The first part of the project I was optimizing

Go To Last Post
21 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi Guys,

Here is the the first part of the project I was trying to optimize with mixing ASM with GCC.

http://www.instructables.com/id/...

Obviously it was not rock-blaster or quad-tris that needed the super-fast line draw routines but game #4 I am working on

AVR-Type

Which will be a vector-ized implementation of a side scrolling shooter with power-ups, crazy weapons and giant end of level bosses.

In the end by using hand crafted ASM and forcing C to use register variables and forcing C to use specific addresses with a new "section" I was able to increase the line/scene draw routine a little bit over 5x as fast as the C-Only version was.

The "API" is still all nice an C callable so the game logic can be done in nice easy to program C. That was the initial goal of my first ever C project on a microcontroller.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I was able to increase the line/scene draw routine a little bit over 5x as fast as the C-Only version was.

You can usually get back 10%+ by hand crafting asm. A 500% increase suggests you weren't writing the C optimally in the first place.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Cliff,

You saw yourself here

https://www.avrfreaks.net/index.p...

That in the C-Structure being read from flash part alone that ASM was more than 2x faster than C. So thats more than a 10% saving.

Maybe it was sub optimal way of doing it, but I did ASK if there was a better way and no one offered more optimal C (that is still readable).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
You can usually get back 10%+ by hand crafting asm. A 500% increase suggests you weren't writing the C optimally in the first place.

It depends on so much.
If it's neat C, then there often are 2x speed gained by "ugly" C with tricks.
But pure ASM offers from nothing (0 to 20% just by optimate the C output) to 2x-10x if it's things that can be done much better because you can make a better structure, or use the HW better if you count the exact clock cycles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sparrow2,

Yes on the "get layout" part of the scene draw routine (the on simplified in that other thread) I could get C to within 30% of the ASM by making it ugly and un-readable.

I think if it is going to be ugly and unreadable you may as well use ASM.

One of the biggest savings in that respect was 16 vs 8 bit.

The X and Y variables are 16 bit as they come in. In clean readable C they get treated as 16 bits the whole way through.

In the ASM version after the very first CP instruction of the first quadrant comparison I can branch off to two separate paths one 8 bit and one 16 bit. >90% of all calls are doable 8 bit.

The other big saving as you suggested was layout/structure related. By carefully placing sturctures in memory I was able to nix the use of MUL and also reduce register usage.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I don't know how you draw a line, but the fastest (I know of) is to make a DDS like thing where you all ways move x or y one, and add a number to a counter for the other one , if it overflow add one else keep it.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

That is called "Bresenham's algorithm".

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sparrow,

Yes I use Bresenhams.

It is optimized for 4 quadrants each broken into steep and shallow as well as the non-angled cases H+ H- V+ V-.

For the ILI932x LCDs that is especially important as your address can be set to auto inc. after a write.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If you really need speed for all, (I used it on a C64 ), then have a table where each bit indicate the carry (for the slow side), so you just make a bit rotate thru carry, and then adc 0. (and load a byte each 8. time, no counter needed set carry after load, then a brance on zero can be used for next byte load).
This was absolutly the fastest way to make circles! (but had to have a table for each size), but for lines it works aswell (but need a table for each slope).

Quote:
Bresenhams

I can see that it is using x1 x2 and y1 y2 distance, if you use overflow on 8 or 16 bit, it will be faster. (if slope is known!)
I made a :
plotline(x,y,arc, length), it was faster than
plotline(x1,x2,y1,y2)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

A 500% increase suggests you weren't writing the C optimally in the first place.


Quote:

You saw yourself here

https://www.avrfreaks.net/index.p... ... p;t=131534

That in the C-Structure being read from flash part alone that ASM was more than 2x faster than C. So thats more than a 10% saving.


That thread was more a "work in progress".

I'm always interested in these "you can't do this [nearly as well] in C" discussions. I'd be interested in seeing your end result--the C, the ASM that is 5x as fast, and info on how you measured and the measurement results.

Are you "cunning_fellow"?

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
Are you "cunning_fellow"?

I live close to him and am helping with code on that project.

Cunning is very much hardcore ASM and hardware hacking. Although he was taught C at uni he doesn't actually think it is "fun".

I convinced him to use C for the asteroids game so it would be easy for other people to hop on board.

I myself don't get involved in my own projects these days as health issues may cause me to pull out with no notice.

theusch wrote:

I'd be interested in seeing your end result--the C, the ASM that is 5x as fast, and info on how you measured and the measurement results.

The measurement of speed was done with a tek scope looking at a spare GPIO pin. At the start of a screen-draw I would send the GPIO high. Then at the end of screen-draw I would send it low.

I wrote the complex parts of the code for AVR-Type on the desktop in C long before cunning had built the hardware. The desktop version could save the state of the screen objects to a text file.

I played the game on the desktop to get to a typical but "busy" screen and hit save.

I then used this as the "test scene".

The "test scene" is not just a BMP or SVG like file. It is actually the state of game objects to get passed to the renderer.

Each change/optimization I made along the way I took note of how much time the test scene took to draw.

The original GCC 3.4.1 version was 26.47mS

My most recent ASM version is 5.13mS

The biggest improvements where

1, Dropping 16 bit maths when I could

2, In-line-ing 8 versions of the inner bresemham loop

3, Aligning the main array of objects (a 16byte structure) to a boundary.

4, Accessing sequential structures with LPM Z+

5, Using "register" variables for X0, Y0

6, Throwing away "register call conventions" for a specialized version of drawline that can't be called by C. I have two drawline routines now drawline() and drawline_asm_only:

The first two could have been done in ugly looking C and they alone where responsible for about a 3x improvement.

Ugly looking C or normal looking ASM are about on par for readability. Most end users will never have to look at the ASM though.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Each change/optimization I made along the way I took note of how much time the test scene took to draw.


Fair enough.

Quote:

The original GCC 3.4.1 version was 26.47mS

My most recent ASM version is 5.13mS

The biggest improvements where

1, Dropping 16 bit maths when I could

2, In-line-ing 8 versions of the inner bresemham loop

3, Aligning the main array of objects (a 16byte structure) to a boundary.

4, Accessing sequential structures with LPM Z+

5, Using "register" variables for X0, Y0

6, Throwing away "register call conventions" for a specialized version of drawline that can't be called by C. I have two drawline routines now drawline() and drawline_asm_only:


So, now I'd repeat:

Quote:

I'd be interested in seeing your end result--the C, the ASM that is 5x as fast ...

But perhaps the entire thing is a bit [pun intended] too extensive for an analysis.

Now, I'm not an expert on GCC for the AVR, and making it do tricks. The numbered list is intriguing, however, with toolchains I'm more familiar with. I'd say that each on its own could be addressed in C. But indeed, when the entire "task" is viewed then I wouldn't disagree that an expert ASM code could use their own model of code generation and get some efficiency for this repetitive job.

I've lost track of what display is being painted with this frame. 320x160 monochrome? That would be 52000 pixels in ~5.2ms, 5200us, or about 100 pixels per microsecond. Sounds pretty impressive.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
about 100 pixels per microsecond.
Not necessarily. Likely the entire frame is not being drawn each time, only the parts that need to be drawn. For something sparse like the asteroid game, there is really very little to draw on each frame.

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
Now, I'm not an expert on GCC for the AVR, and making it do tricks. The numbered list is intriguing, however, with toolchains I'm more familiar with. I'd say that each on its own could be addressed in C. But indeed, when the entire "task" is viewed then I wouldn't disagree that an expert ASM code could use their own model of code generation and get some efficiency for this repetitive job.

Read the thread :
Quote:
I could get C to within 30% of the ASM by making it ugly and un-readable.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The LCD screen is 320x240 16 bit colour. I am only plotting single colour lines per objects.

As Koschi said - Asteroids (or AVR-Type) are sparse screens that only have a few lines on them. This draw screen routine is not calculating the clear screen time which is

320x240 = 76800 pixels
9.8uS to clear
8 million pixels / second (or very close too)

The nice thing about clear screen though is that it is "pseudo DMA". Where TOC1 does the clearing for me and the CPU is free to X-sort the objects so I can draw them most optimally

The Scene itself is 1641 plotted pixels in total (some pixels over-plotted for fewer than that visable on screen). 1641 Pixels in 5.13mS is 320 pixels per mS or ~50 clock ticks per pixel.

Quote:
each on its own could be addressed in C

#3, Was implemented in C (with "section") but the C compiler would not take advantage of the easy maths it affords. Human can now swap/shift/AND pointers rather than using MUL.

#1, and #2 could be done in C. #2 would look a little bit ugly but #1 would look horrible.

I have 16 bit X and 8 bit Y values. If either X0 or X1 are greater than 256 then all the maths has to be done 16bit. (dX, dY, err), In ASM the first thing I do is calculate dX and if I can use 8bit maths I keep going with an 8bit only routine. If I need 16bits I branch off to a fully 16bit routine (8bits is most common so I branch to 16bit)

#3, #5 are both anti-C optimizations. I am doing things in ASM which the high priests say

"you should never do that - the compiler is smart and can figure it out better than you can"

#4 was just a classic case of a human spotting an elegant optimization the compiler is trying to brute force better. (see other thread for explanation)

#6 is Human using all 32 register fully.

The "scene" has 390 lines. There are 10 constants that have to be LDI to registers to output them to the data port at least once per line. These are "commands" to the LCD screen

SET UPDATE DIRECTION
SET X
SET Y
SET DATA

and so on.

The ASM only version pre-loads these into spare registers at the start of scene draw and never has to do an LDI again for all of its 1641 pixels.

Quote:
But perhaps the entire thing is a bit [pun intended] too extensive for an analysis.

The code is more than a BIT crazy. It should all be available to download on the instructables project page as far as I know.

I can post the vanilla C bresenham routine here if you like, but it is such a common thing (it's on wikipedia) that hardly seems needed.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If you run @16MHz it will be about 50 clk pr. dot, it sounds like a lot!
unless "SET UPDATE DIRECTION" take a lot of time.
How long does it take to plot a line that is 100 dot's long?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You can plot a horizontal or vertical line at 2 cycles per pixel. The preamble and postamble have an overhead though.

Of course, most lines are at an angle. Hence you need to calculate which pixels to plot. A shallow or steep line can be approximated by short horiz or vertical lines.

In practice, you have to calculate where each pixel is going to be. This often requires maths.
Then you have to send the specific drawPixel(x, y, colour) command. This involves writing some 6 16-bit values for each pixel. I can't see how you can do this in less than 24 cycles.

If you need the speed, go for a different MCU.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sparrow, As David says, for a single pixel you have to write out 6 x 16 bit values

0x0020 = Set X coordinate Command
0x0XXX = X coordinate (where XXX is 0..319)
0x0021 = Set Y coordinate Command
0x00YY = Y coordinate (where YY is 0..239)
0x0022 = Set Pixel Colour
0xCCCC = Colour (where the 16b are bbbbbggggggrrrrr)

Luckily you don't have to do that every pixel and only when "err" causes a step.

Then there is Bresenham around the outside of that.

Then there is the object fetching from flash around that.

David, This was an exersize in showing what you could do with an AVR :)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

but you wrote

Quote:
For the ILI932x LCDs that is especially important as your address can be set to auto inc. after a write.
so you don't need 6 writes or does that just count where dX or dY is 0?

For a prototype I have had a run a similar display but with an other chip (1663 I think it was called), but because it mostly was for debugging I did not optimize the rutines.
But I wrote a plot lib for the c64, and because of the odd memory layout I had to make some lookup tables (math is not the 6502's strong side but pointers are).

Add:
And I guess that you only have set the colour if you actually want to change it.

Edit it was a SSD1963

Last Edited: Tue. May 7, 2013 - 12:39 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If you are writing a single pixel then you have to write out the 6 words.

In a straight line, to write the next pixel right next to the first, you only have to toggle the clock.

If it is a diagonal line and there has been a step, then you have to write out all 6 words again.

At the start of the line you also have to write out 2 words

0x0003 = SET AUTO INCREMENT DIRECTION
0x.... = H+ or H- or V+ or V-

So when you do just toggle clock it does the next pixel in the correct direction.

This means that short lines are not effecient and that lines close to 45 degree are bad.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes, dX or dY being 0 is how the FAST drawFastxLine() works.
As Andrew has shown, you still have to write the colour to each pixel. However, if the colour is already in registers, you only need to OUT those registers to your PORTs.

I can't think of how any hardware LCD controller could work any better. The only possible way would be to set up a combined X and Y auto-increment. This would mean the LCD controller needs to perform maths. A more likely 'enhancement' would be to set an auto X-increment so that you only need to set Y and colour. Or another might be to remember the colour.

David.