GCC vs ASM code size - Help me close the gap

Go To Last Post
104 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Now - before anyone claims this is a set up - you can look on the uzebox forums. This code was posted on the 27th Jan which was 2 days before I first spoke on the lazarus troll thread.

Also I HAVE asked here several times before for help in writing better C so the compiler knows how to optimize what I am doing. So this is in form with what I have asked in the past.

The troll thread DID prompt me to ask this question though. I am curious if GCC can get close and if the way I am writing the argument is sub optimal.

Again - I do write C code. I like to write C code. I was formally taught to write C code in a university environment. All that said I know I am a pretty average C coder.

void SetRamTile(uint8_t x, uint8_t y, uint8_t tile)
{
  x = x & 0x1f;
  y = y & 0x1f;
  vram[((y>>3)*256)+x<<3+(y&7)] = tile;
}

becomes

226:      {
+000060BB:   716F        ANDI      R22,0x1F       Logical AND with immediate
229:        vram[((y>>3)*256)+x<<3+(y&7)] = tile;
+000060BC:   2FF6        MOV       R31,R22        Copy register
+000060BD:   95F6        LSR       R31            Logical shift right
+000060BE:   95F6        LSR       R31            Logical shift right
+000060BF:   95F6        LSR       R31            Logical shift right
+000060C0:   E0E0        LDI       R30,0x00       Load immediate
+000060C1:   718F        ANDI      R24,0x1F       Logical AND with immediate
+000060C2:   0FE8        ADD       R30,R24        Add without carry
+000060C3:   1DF1        ADC       R31,R1         Add with carry
+000060C4:   E070        LDI       R23,0x00       Load immediate
+000060C5:   7067        ANDI      R22,0x07       Logical AND with immediate
+000060C6:   7070        ANDI      R23,0x00       Logical AND with immediate
+000060C7:   5F6D        SUBI      R22,0xFD       Subtract immediate
+000060C8:   4F7F        SBCI      R23,0xFF       Subtract immediate with carry
+000060C9:   C002        RJMP      PC+0x0003      Relative jump
+000060CA:   0FEE        LSL       R30            Logical Shift Left
+000060CB:   1FFF        ROL       R31            Rotate Left Through Carry
+000060CC:   956A        DEC       R22            Decrement
+000060CD:   F7E2        BRPL      PC-0x03        Branch if plus
+000060CE:   5EE0        SUBI      R30,0xE0       Subtract immediate
+000060CF:   4FFE        SBCI      R31,0xFE       Subtract immediate with carry
+000060D0:   8340        STD       Z+0,R20        Store indirect with displacement
230:      }
+000060D1:   9508        RET                      Subroutine return

Now I have tried replacing the

(y>>3)*256

part with several other variations including a struct/union that has 2x uint8 sharing with a uint16. None of them ended up better, only different.

In fact the original version was

vram[((y>>3)*256)+8*x+(y&7)]

which I changed the (8*x) to (X<<3) and that did yield an improvement.

Anyways, the ASM code that gets the same result is

;***********************************
; SET TILE 8bit mode
; C-callable
; r24=X pos (8 bit)
; r22=Y pos (8 bit)
; r20=Tile No (8 bit)
;************************************
.section .text.SetTile
SetTile:

#if SCROLLING == 1
;index formula is vram[((y>>3)*256)+8x+(y&7)]

;                         r22               r24               XH               XL                T
                        ; y7y6y5y4y3y2y1y0  x7x6x5x4x3x2x1x0  - - - - - - - -  - - - - - - - - 

lsl   r24               ; y7y6y5y4y3y2y1y0  x6x5x4x3x2x1x0 0
lsl   r24               ; y7y6y5y4y3y2y1y0  x5x4x3x2x1x0 0 0
lsl   r24               ; y7y6y5y4y3y2y1y0  x4x3x2x1x0 0 0 0

mov   XH,r22            ;                                     y7y6y5y4y3y2y1y0 
lsl   XH                ;                                     y6y5y4y3y2y1y0 0
swap  XH                ;                                     y2y1y0 0y6y5y4y3
andi  XH,3              ;                                     0 0 0 0 0 0 y4y3

andi  r22, 7            ; 0 0 0 0 0 y2y1y0
or    r24, r22          ;                   x4x3x2x1x0y2y1y0
mov   XL, r24           ;                                     0 0 0 0 0 0 y4y3 x4x3x2x1x0y2y1y0
subi  XL,lo8(-(vram))
sbci  XH,hi8(-(vram))

st   X,r20

ret

#else

Now I would be surprised if any compiler did the LSL/SWAP bit as that normally would not save you any time if not for the fact that bits5..7 are don't cares.

But is there a way I can hint the compiler to get it better than it is?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
x = x & 0x1f;
y = y & 0x1f;

No fair; your assembly code doesn't do that (having noticed that it's unneeded.

But yeah; if you look at the code, and follow some of the C discussions, you'll notice that some of the inefficiencies are due to this rule that C has about making intermediate results be "int" in complex expressions. So by eliminating complex expressions:

void SetRamTile2(uint8_t x, uint8_t y, uint8_t tile)
{
//  x &= 0x1F;
//  y &= 0x1F;
  uint8_t xt = x<<3;
  uint8_t yt = y>>3;
  xt = xt + (y & 7);
  vram[(yt*256)+xt] = tile;
} 

You get more compact code:

00000000 <_Z11SetRamTile2hhh>:
   0:   88 0f           add     r24, r24
   2:   88 0f           add     r24, r24
   4:   88 0f           add     r24, r24   ;; x<<3
   6:   26 2f           mov     r18, r22
   8:   27 70           andi    r18, 0x07   ;; y & 7
   a:   28 0f           add     r18, r24   ;; y&7 + x<<3
   c:   96 2f           mov     r25, r22  ;; y again
   e:   96 95           lsr     r25
  10:   96 95           lsr     r25
  12:   96 95           lsr     r25    ;; y >> 3
  14:   80 e0           ldi     r24, 0x00  ;; *256
  16:   82 0f           add     r24, r18 ;; add (now 16b)
  18:   91 1d           adc     r25, r1
  1a:   e0 91 00 00     lds     r30, 0x0000  ; vram
  1e:   f0 91 00 00     lds     r31, 0x0000
  22:   e8 0f           add     r30, r24
  24:   f9 1f           adc     r31, r25
  26:   40 83           st      Z, r20
  28:   08 95           ret
Last Edited: Mon. Feb 3, 2014 - 02:59 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Westfw,

Thanks for that. It makes sense why my all my attempts to help the compiler out failed - knowing it was doing int anyways.

BTW - I would still need to keep the

y &= 0x1F;

at least.

It is there for safety. The end user _should_ not pass in X or Y values greater than 31. If they do though - then some random bit of RAM is going to get trashed.

The ASM version I posted will always leave XH as

0 0 0 0 0 0 Y4Y3

The C code there will leave XH as

0 0 0 y7y6y5y4y3

Thats gotten it down from 23 to 20 words and a LOT of clock cycles faster (I'm not going to add them up)

Down from 40% smaller to 30% smaller and even bigger gap closing in the speed stakes.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Seems like you're imposing a mathematical approach when you've acknowledged that at its root it's really a bit mapping problem:

0 0 0 0 0 0 y4y3 x4x3x2x1x0y2y1y0

So why hobble the compiler with the math?

void SetRamTile(uint8_t x, uint8_t y, uint8_t tile) {
  uint8_t hi = (y>>3) & 3;
  uint8_t lo = (x<<3) | (y&7);
  vram[(hi<<8) | lo] = tile;
}
00000000 :
  vram[(yt*256) + xt] = tile;
}
#else
void SetRamTile(uint8_t x, uint8_t y, uint8_t tile) {
  uint8_t hi = (y>>3) & 3;
  uint8_t lo = (x<<3) | (y&7);
   0:   96 2f           mov     r25, r22
   2:   97 70           andi    r25, 0x07       ; 7
   4:   88 0f           add     r24, r24
   6:   88 0f           add     r24, r24
   8:   88 0f           add     r24, r24
   a:   89 2b           or      r24, r25
  yt = (y>>3) & 3;
  vram[(yt*256) + xt] = tile;
}
#else
void SetRamTile(uint8_t x, uint8_t y, uint8_t tile) {
  uint8_t hi = (y>>3) & 3;
   c:   66 95           lsr     r22
   e:   66 95           lsr     r22
  10:   66 95           lsr     r22
  12:   63 70           andi    r22, 0x03       ; 3
  uint8_t lo = (x<<3) | (y&7);
  vram[(hi<<8) | lo] = tile;
  14:   e8 2f           mov     r30, r24
  16:   f0 e0           ldi     r31, 0x00       ; 0
  18:   f6 2b           or      r31, r22
  1a:   e0 50           subi    r30, 0x00       ; 0
  1c:   f0 40           sbci    r31, 0x00       ; 0
  1e:   40 83           st      Z, r20
  20:   08 95           ret

andrewm1973 wrote:
Thats gotten it down from 23 to 20 words and a LOT of clock cycles faster (I'm not going to add them up)
Your assembler version takes 14 words and 18 cycles, the above C takes 17 words and 21 cycles, both including the ret.

So C is 3 words/cycles larger/slower, or about 21% larger and 17% slower, than your assembler.

Where do those 3 come from?

- The C version uses ldi/or an where a mov would do, making for 1 wasted word/cycle due to the 16-bit nature of the final expression. Type punning could likely get rid of that, although that's a bad idea of course.

- There's an unnecessary mov to r25 when it could have gone to r30 straight away.

- The lsl/swap technique buys you one cycle.

I've seen the compiler do some pretty clever things with swap, not sure why it's not doing it here.

JJ

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thank you also JJ.

Quote:
So why hobble the compiler with the math?

((y>>3)*256)+8x+(y&7)

was what I was given by someone else.

I instantly realised it was a bit mapping problem when I converted it to ASM.

I did experiment in the C though and I did see a difference between *8 and <<3 in the long complex equation.

Before I knew that breaking it into smaller parts helped I did try both *256 and <<8. Both where equally bad when it was treating it as an int16.

So it ended up as

((y>>3)*256)+x<<3+(y&7)

rather than

((y>>3)*256)+8*x+(y&7)

Quote:
I've seen the compiler do some pretty clever things with swap, not sure why it's not doing it here

The SWAP/LSL only saves the one clock because the higher bits are don't-cares. I didn't think the compiler would get that even after you guys have shown me where I was going wrong.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

andrewm1973 wrote:
I instantly realised it was a bit mapping problem when I converted it to ASM.
You performed two independent tasks:
- redefining the problem into more suitable terms
- (re)writing the solution in another language

These are actually two separate activities. You can perform one without performing the other. Performing both and pointing at one language's success over another is stacking the deck (I know that's not what you're saying... Thank you by the way for forking the debate... It was getting a bit hot in there :) )

While thinking about and programming solutions to problems is invariably (and to some degree inextricably) related to the choice of language, the best approach in one language is generally the best approach in another.

Even more so when the two languages share so much in common. C has been referred to as a collection of fancy assembler macros. If memory serves, that statement has been made both by ASM and C programmers. It has also been refuted by both.

While I see the reasoning behind the statement (and refutations), I think it underscores the root problem: Humans don't know how to think 'outside the box'. Yes, that's a broad statement. Intentionally made for it's irony ;) (ask Alanis).

Too often it seems these debates (like the one from which you forked) miss the point before they start. Never mind the language. What's the best solution to the problem? Indeed, what is the problem?

Once you have an idea how best to tackle the problem, asking yourself in which language the solution should be implemented is often almost as irrelevant as asking what colour hammer you should use to pound in this nail. I like the blue ones.

Granted that's an oversimplification. And there are real differences that can't be ignored. Lisp isn't great for expressing procedural solutions, C can't access the carry bit, and ASM has no notion of type. But doesn't that just mean that when I want to put in an upholstery tack I use an upholsterer's hammer, and when I want to put in a finishing nail I use a trim hammer and a nail set?

Too late to say I don't want to get too philosophical... but I think if people spent more time framing the problem and less time implementing the solution there would be fewer flame wars (and real ones?).

Quote:
I did experiment in the C though and I did see a difference between *8 and <<3
Why did you stop there? You saw an improvement, and you'd already "instantly realised it was a bit mapping problem" when you wrote the ASM.

This is a pretty clear example of how a language (programming and human) or our perception of it can sometimes blind us. I'm just as guilty, by the way ;)

I have now done perhaps exactly what you'd hoped to avoid: brought your thread back around to the 'debate' instead of dealing with 'the problem' and discussing the solutions. If you'd rather I had not done that, let me know and I'll carve the philosophy out and post it into the original troll-thread. I think I'll use one of these...

Quote:
I didn't think the compiler would get that even after you guys have shown me where I was going wrong.
So give it the hint:
  uint8_t hi = y<<1;
  hi = (hi>>4) | (hi<<4);
  hi = hi&3;
  uint8_t hi = y<<1;
   0:   e6 2f           mov     r30, r22
   2:   ee 0f           add     r30, r30
  hi = (hi>>4) | (hi<<4);
   4:   e2 95           swap    r30
  hi = hi&3;
   6:   e3 70           andi    r30, 0x03       ; 3

I'm still pretty sure I've seen it do this on it's own... but of course I can't make it happen now :?

In any case this doesn't result in a smaller/faster function, the compiler does an additional mov later that nullifies the gain. I haven't dug any deeper to see if it can be dispatched.

JJ

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

BTW - I would still need to keep the

Code:
y &= 0x1F;

at least.

It is there for safety.


Just a side note, because it often comes up in this type of discussion:

I certainly won't disagree to write protective/robust code. But it is often found, as noted above, that when we get "you can't do this in C; see this much shorter ASM sequence" that the same problem is not being solved.

When working on the ASM sequence, the guru that thinks in machine code is going to use brain processes "... this reg can only be this big a value at this time so I can now apply the next thing in an unsigned manner and I know it is only a few bits and won't overflow and ...". The compiler needs to best do this by applying rules and heuristics and such.

So indeed when you have an example such as yours, and it is an "important" piece (as yours is) then you have to use the same processes in your head to give the compiler a hint, or tell it to override a "rule" or two.

Referring back to the other thread, using this type of process has gotten C "close". Not e.g. x2.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This is what I got as a fairly straight-forward hand assembly of the C code.

#define x  R24
#define y  R22
#define tile R20
    .global SetTile
SetTile:
    ANDI x, 0x1F
    ANDI y, 0x1F
    LDI  R30, 7
    AND  R30, y
    LSL x
    LSL x
    LSL x
    ADD  R30, x  ; OR would have same result
    LSR y
    LSR y
    LSR y
    LDI R31, hi8(vram)
    ADDI R30, lo8(vram)
    ADC  R31, y
    ST Z, tile
    RET

16 words and 20 cycles counting the return.
I could have saved a cycle and a word using swap.
OTOH my selection of registers targets might be regarded as mildly clever.
OYAH I didn't use any information not given the compiler.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:

You performed two independent tasks:
- redefining the problem into more suitable terms
- (re)writing the solution in another language

Yes. Although in my head and to me that always looks like one task. I am guilty of not spotting them as two tasks because of my own familiarity with working with bits.

I guess a bit of my problem there also is that I am expecting too much of the compiler. It shows I don't know enough about compilers to write my own even though I can write quite acceptable ASM code :)

I though that the compiler would know that bits 5,6 and 7 are don't cares because of the "AND 0x1F" the same way I just know it.

joeymorin wrote:
Quote:

I did experiment in the C though and I did see a difference between *8 and <<3

Why did you stop there? You saw an improvement, and you'd already "instantly realised it was a bit mapping problem" when you wrote the ASM.

When I tried <<8 vs *256 the compiled code got no smaller and got more confusing. Now I know this was an artifact of doing it all in one line rather than breaking up the lines. I think I probably would have gotten pretty close in the end if I had of not choosen the wrong path (complex equation) to start with.

joeymorin wrote:

hi = (hi>>4) | (hi<<4);

For <<4 or >>4 swap is always an option. When you have to shift by 3 - swap/lsl or lsr won't always win.

theusc wrote:

So indeed when you have an example such as yours, and it is an "important" piece (as yours is) then you have to use the same processes in your head to give the compiler a hint, or tell it to override a "rule" or two.

Well to me it is not that important code. It is part of the Uzebox library for doing tile based games like mario/zelda and I have always thought those games a bit naff.

It _IS_ library code though so it deserves to be as fast as possible.

It does demonstrate a point along lines similar to what joeymorin made above.

Uze/Alec wrote the original ASM code used in the library. His ASM was faster than his C.

I rewrote the thing in ASM based off his C. My ASM was faster than his ASM.

People here that know more about the compiler than I do have rewritten the same function in C.

The new C code from here is not as fast as the minimal ASM, but it is exactly the same (21 cycles) as the ASM code that was in the library.

Thank you again all. I am learning to think like the compiler a bit better.

If no one minds I may continue to post in this thread other examples of C vs ASM I have done recently. It can help prove how bad I am with AVR-C but more importantly it might be a good reference to point future C vs ASM flame wars at (as long as we can manage to keep this thread flame free)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

andrewm1973 wrote:
I guess a bit of my problem there also is that I am expecting too much of the compiler.
Perhaps. A (any) compiler is after all merely (rather, impressively) a collection of rules and when to apply them. A human (or several) had to come up with those rules, decide when to apply them, and write code to implement those rules and whens. A daunting task. I'm often surprised at how much it gets right.

In this respect a compiler is a knowledge based expert system built upon the experience of other programmers, most of whom are experts at C and other languages, including assembly, often on multiple platforms. Thanks to their efforts you may wield an impressively destructive hammer with relative ease ;)

We should remember though that we have only been talking about AVR GCC, itself based on the multi-platform, multi-target, multi-language GCC. A generalist in a room full of specialists.

I'd be interested to see how other compilers deal with the various versions of your code. I have no others on my machine, but I expect some would fair better, some would fair worse.

As is, GCC is only 3 cycles away from the currently best ASM solution offered, and some or all of them may yet disappear if an actual GCC guru chimes in. The lsl/swap technique is reachable with the right hints (as shown above), if only the later (and likely unrelated) mov can be suppressed. It might be a simple matter of the right command-line option to the compiler.

Quote:
I though that the compiler would know that bits 5,6 and 7 are don't cares because of the "AND 0x1F" the same way I just know it.
So did I. I know I've seen something similarly clever, but can't conjure it up.

Quote:
I am learning to think like the compiler a bit better.
That is a colourful crayon in anyone's box, but a case could be made that you should't have to think like the compiler. HLLs are meant to free us from such restrictions. On embedded systems, certainly bare-metal systems like AVR8, we may never be able to fully enjoy that freedom. That doesn't bother me ;)

Quote:
as long as we can manage to keep this thread flame free)
I promise :)

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
andrewm1973 wrote:
as long as we can manage to keep this thread flame free)
I promise :)

OK - well here is my next question to the smart C people in the "closing the gap" series.

The ASM detractors will love this one. They can say "Look at all the effort the ASM monkey had to go to to save one clock cycle"

The previous routine to get an X/Y tile was from Uze/Alecs original library code for a TILE video mode.

This next bit is from the video mode that was modified to get Asteroids going.

One of the things done for that video mode was to locate the VRAM array at a specific location.

in the Makefile

LDFLAGS += -Wl,--section-start=.ramtiles=0x00800400
LDFLAGS += -Wl,--section-start=.vram=0x00800C00

declaring the variables

unsigned char ramTiles[256*8]  __attribute__ ((section (".ramtiles")));
unsigned char vram[32*28] __attribute__ ((section (".vram")));

and here is my macro that converts an X/Y pixel location to a X/Y tile location.

; fast_line_convert_x0_y0_into_VRAM_address
;
; converts the X0 and Y0 address passed in r24 and r22 into a VRAM memory
; location and leaves this result in R26:27 (VRAM_Address)
;
; Inputs 
;           Y0 address = R22
;           X0 address = R24
; Outputs
;           VRAM_Address = R26:27 (X)
; 
; Requires that the constants 4 and 32 are in R28, R29
;
; Trashes R0:1

.macro fast_line_convert_x0_y0_into_VRAM_address
    mul  r22, r28           ; Multiply Y0 by 4   y7y6y5y4y3y2y1y0      => .0.0.0.0.0.0y7y6:y5y4y3y2y1y0.0.0
    movw r26, r0            ; move the 16 bit result into VRAM_Address
    andi r26, 0b11100000    ; clear out the bits that are used for Xn     .0.0.0.0.0.0y7y6:y5y4y3.0.0.0.0.0
    mul  r24, r29           ; Multiply X0 by 32  x7x6x5x4x3x2x1x0      => .0.0.0x7x6x5x4x3:x2x1x0.0.0.0.0.0
    or   r26, r1            ; OR X7..3 into low byte of VRAM_Address      .0.0.0.0.0.0y7y6:y5y4y3x7x6x5x4x3
    subi r27, hi8(-(vram))  ; Add the base address of VRAM                .0.0.0.0.1.1y7y6:y5y4y3x7x6x5x4x3
.endm

This is effectively

VRAM[(((Y&0xF8)>>3)<<8|(X>>3)]

Not asking about optimizing C bit manipulations here. You guys have shown me the light there.

This is all about the line

    subi r27, hi8(-(vram))  ; Add the base address of VRAM                .0.0.0.0.1.1y7y6:y5y4y3x7x6x5x4x3

Because I placed VRAM where I wanted it on a 10 bit boundary. I can add the base address of VRAM to the calculated array address with just one subi. This is as opposed to two subtracts as the previous example.

subi  XL,lo8(-(vram)) 
sbci  XH,hi8(-(vram))

So now the obvious question. How do I hint/convince/trick GCC into making

VRAM[Address]

only bother to add/sub the top 8 bits?

It is again something I thought C would optimize away itself automagically. The base address is 0x0c00. The bottom 8 bits is 0x00. That is a constant. It should not need to add 0x00 to something to get the result.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

It is again something I thought C would optimize away itself automagically

But the C compiler doesn't know the address. That is only resolved when the linker operates. Now it used to be the case that all the optimisation happened at compile time but recently the linker got Link Time Optimisation (LTO) so you may want to explore what this offers.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
But the C compiler doesn't know the address. That is only resolved when the linker operates.
Quote:

I get in the normal situation the compiler doesn't doesn't know the address so outputs symbols and the linker joins the dots.

I thought that I was telling linker and the compiler what address to use up front though.

Does this mean there is no way to get the compiler to take advantage of the memory alignments I have done?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I thought that I was telling linker and the compiler what address to use up front though.

Why would the compiler be interested?
Quote:

Does this mean there is no way to get the compiler to take advantage of the memory alignments I have done?

What did your exploration of LTO turn up? I don't know if it will help or not as I don't know what link time optimisations are done for AVR but it is probably the best hope.

I guess the LTO journey probably starts here:

http://gcc.gnu.org/wiki/LinkTime...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
What did your exploration of LTO turn up?

I read the page you linked and at least two levels deep linking away from any blue text on that page.

I understood nearly none of it.

Remember I am the person that isn't smart enough to know that writing

unsigned char vram[32*28] __attribute__ ((section (".vram")));

In the source file IS NOT actually telling the compiler anything.

I shan't give up though. I will try google some of the words I didn't understand on that wiki page and see if I can't bang some knowledge into my head. I might need to borrow one of joeys hammers.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

In the source file IS NOT actually telling the compiler anything.

It is telling the compiler something. It is saying "when you create this object" add a .section directive to the Asm source file - nothing more, nothing less.

If I use that very line then the Asm generated is:

.global	vram

	.section	.vram,"aw",@progbits
	.type	vram, @object
	.size	vram, 896
vram:
	.zero	896

If I remove the section attribute from the C source then the Asm code is:

	.comm	vram,896,1

As you can see neither conveys any information about alignment or position in memory. That decision doesn't come until later when either the linker script or a -section-start tells the linker where this will be placed. The C compiler has already generated Asm source to access it by this point (assuming you wrote some C to do so). Now there is a chance that LTO may recognise some aspect of the positioning and access that allows it to make assumptions about index register loads or whatever. Then again LTO is principally for the gcc "big iron" and it's also fairly new so it may not do a great job for AVR just yet until an AVR linker expert works on it a bit.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

How many places call this code?
If not to many force the compiler to make it inline, then you will save at least 7 clk.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
unsigned char *vram2=(unsigned char*)(0xFF00 & (int)vram);

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If the need for speed is great enough,
one might convert the function into a macro.
If that is not good enough, one might dispense with optimizing single tile settings and optimize loops instead.

for(unsigned char y=...) {
    // y in 0..0x1F
    for(unsigned char x=...) {
        tile=...;
        vram[(y>>3)*0x100 + (y&7) + (unsigned char)(x<<3)]=tile;
    }  // x
} // y


for(unsigned char x=...) {
    for(unsigned char yhi=...) {
        // yhi in 0..3
        for(unsigned char ylo=...) {
            unsigned char y=yhi*8+ylo;
            tile=...;
            vram[(unsigned char)8*x+yhi*0x100+ylo]=tile;
        } // ylo
    }  // yhi
} // x

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
Then again LTO is principally for the gcc "big iron" and it's also fairly new so it may not do a great job for AVR just yet until an AVR linker expert works on it a bit.
I've fiddled with LTO, although not extensively.

In 4.7.2 it would in some cases make code significantly less efficient.

4.8.1 appears to have improved, but I haven't spent a lot of time with it.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
How many places call this code?
If not to many force the compiler to make it inline, then you will save at least 7 clk.

The code I posted is at the heart of the SetPixel routine that is within the the bresenham code. So there are 18+ places that can call it. As I am going for speed over size in-lining is assumed :) There is 64 kilobytes so I can afford to waste 500 duplicating that code 20 times.

It was not the whole code block I was interested in though. It was just that one little part. Saving one clock cycle from VRAM access. That one construct to access the VRAM memory

VRAM[address];

is something I could not figure out how to hint/trick the compiler to do the obvious thing that the ASM does. In fact it is 1/2 the reason for me aligning the VRAM on a 10 bit boundary.*

I am asking here to "close the gap". I have it about as fast as (I think) it's ever going to happen on an AVR8.

I am hoping to learn how to make my C better/closer to ASM.

skeeve wrote:

unsigned char *vram2=(unsigned char*)(0xFF00 & (int)vram);

OK. I hardly understand that.

A temporary variable vram2
vram2 is a pointer to an u_char
it becomes the vram address masked with 0xFF00

I will try it in the code to see how it works out.

* The other 1/2 reason for the 10 bit alignment is the code that puts the pixel stream out to the TV. The pixel render code also gets the same 1 clock advantage. It is one of those ASM only - C could never do it things. It only has 5 clocks per pixel. A one clock advantage to it is significant.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

andrewm1973 wrote:
How do I hint/convince/trick GCC into making VRAM[Address] only bother to add/sub the top 8 bits?
When you have reached that level, you want to use assembler (likely) or badly hack GCC (unlikely).

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

andrewm1973 wrote:
The code I posted is at the heart of the SetPixel routine that is within the the bresenham code. So there are 18+ places that can call it. As I am going for speed over size in-lining is assumed :) There is 64 kilobytes so I can afford to waste 500 duplicating that code 20 times.
Is this the case for SetRamTile() as well? If so, GCC does a better job when inlining, because arguments are likely to be moved into the correct register immediately instead of comforming to the calling convention:
void SetRamTile(uint8_t x, uint8_t y, uint8_t tile) {
  uint8_t hi = (y>>3) & 3;
  uint8_t lo = (x<<3) | (y&7);
  vram[(hi<<8) | lo] = tile;
}

int main (void) {
  while (1) {
    SetRamTile(GPIOR0, GPIOR1, GPIOR2);
  }
}
00000090 
: int main (void) { while (1) { SetRamTile(GPIOR0, GPIOR1, GPIOR2); 90: 9b b5 in r25, 0x2b ; 43 92: 8a b5 in r24, 0x2a ; 42 94: ee b3 in r30, 0x1e ; 30 void SetRamTile(uint8_t x, uint8_t y, uint8_t tile) { uint8_t hi = (y>>3) & 3; uint8_t lo = (x<<3) | (y&7); 96: ee 0f add r30, r30 98: ee 0f add r30, r30 9a: ee 0f add r30, r30 9c: 28 2f mov r18, r24 9e: 27 70 andi r18, 0x07 ; 7 a0: e2 2b or r30, r18 void SetRamTile(uint8_t x, uint8_t y, uint8_t tile) { uint8_t hi = (y>>3) & 3; a2: 86 95 lsr r24 a4: 86 95 lsr r24 a6: 86 95 lsr r24 a8: 83 70 andi r24, 0x03 ; 3 uint8_t lo = (x<<3) | (y&7); vram[(hi<<8) | lo] = tile; aa: f0 e0 ldi r31, 0x00 ; 0 ac: f8 2b or r31, r24 ae: e0 50 subi r30, 0x00 ; 0 b0: ff 4f sbci r31, 0xFF ; 255 b2: 90 83 st Z, r25 b4: ed cf rjmp .-38 ; 0x90

The meat is now down to 15 words and 16 cycles. There is still a word/cycle wasted by an ldi/or where a mov would suffice, and the lsl/swap trick is unsurprisingly still absent, but the margin is now down to 2 words/cycles over the optimal ASM.

Quote:
A temporary variable vram2
vram2 is a pointer to an u_char
it becomes the vram address masked with 0xFF00

I will try it in the code to see how it works out.

I haven't tried it, but I expect care will be required in order to not incur additional overhead in the form of mov/movw instructions, and you'll likely still need to break the complex expression down as before.

Quote:
It is one of those ASM only - C could never do it things.
Never say never ;) ...

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
or badly hack GCC
Type-punning and the like? Yeah, best avoided.

Any other hacking you can think of?

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
andrewm1973 wrote:
A temporary variable vram2
vram2 is a pointer to an u_char
it becomes the vram address masked with 0xFF00

I will try it in the code to see how it works out.

I haven't tried it, but I expect care will be required in order to not incur additional overhead in the form of mov/movw instructions, and you'll likely still need to break the complex expression down as before.

I could not get the suggested pointer tricks to come in any better than my baseline shown below

uint8_t eXL = PINB;     // IN something from PINB and PINC
uint8_t eXH = PINC;     // so the compiler doesn't optimize

eXH &= 3;

vram[eXH<<8|eXL] = 0x22;

Skeeve - Any tips to how to change the above with your pointer suggestion?

joeymorin wrote:
Quote:
It is one of those ASM only - C could never do it things.
Never say never ;) ...

I feel pretty confident I can say never here.

Willing to be proven wrong though.

I can start up another thread about how the 4.1 colour video mode ASM works. I probably would even offer a prize/reward for any C guru that can duplicate it without using any ASM.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

andrewm1973 wrote:
I can start up another thread about how the 4.1 colour video mode ASM works. I probably would even offer a prize/reward for any C guru that can duplicate it without using any ASM.

Now your talking. :wink: Many here like a good challenge. :)

"I may make you feel but I can't make you think" - Jethro Tull - Thick As A Brick

"void transmigratus(void) {transmigratus();} // recursio infinitus" - larryvc

"It's much more practical to rely on the processing powers of the real debugger, i.e. the one between the keyboard and chair." - JW wek3

"When you arise in the morning think of what a privilege it is to be alive: to breathe, to think, to enjoy, to love." -  Marcus Aurelius

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

andrewm1973 wrote:

uint8_t eXL = PINB;     // IN something from PINB and PINC
uint8_t eXH = PINC;     // so the compiler doesn't optimize

eXH &= 3;

vram[eXH<<8|eXL] = 0x22;

Skeeve - Any tips to how to change the above with your pointer suggestion?


unsigned char *vram2=(unsigned char*)(0xFF00 & (int)vram);
uint8_t eXL = PINB;     // IN something from PINB and PINC
uint8_t eXH = PINC;     // so the compiler doesn't optimize

eXH &= 3;

vram2[eXH<<8|eXL] = 0x22;

vram2 is a pointer whose high byte can be found with "LDI , hi8(vram)"
and whose low byte does not need finding because it is 0.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Skeeve,

When I tried that I ended up with

	unsigned char *vram2=(unsigned char*)(0xFF00 & (int)vram);
	 uint8_t eXL = PINB;     // IN something from PINB and PINC
    9e02:	23 b1       	in	r18, 0x03	; 3
	 uint8_t eXH = PINC;     // so the compiler doesn't optimize
    9e04:	46 b1       	in	r20, 0x06	; 6

	 eXH &= 3;

	 vram2[eXH<<8|eXL] = 0x22;
    9e06:	e0 e0       	ldi	r30, 0x00	; 0
    9e08:	fc e0       	ldi	r31, 0x0C	; 12
    9e0a:	e0 70       	andi	r30, 0x00	; 0
    9e0c:	94 2f       	mov	r25, r20
    9e0e:	93 70       	andi	r25, 0x03	; 3
    9e10:	80 e0       	ldi	r24, 0x00	; 0
    9e12:	30 e0       	ldi	r19, 0x00	; 0
    9e14:	82 2b       	or	r24, r18
    9e16:	93 2b       	or	r25, r19
    9e18:	e8 0f       	add	r30, r24
    9e1a:	f9 1f       	adc	r31, r25
    9e1c:	82 e2       	ldi	r24, 0x22	; 34
    9e1e:	80 83       	st	Z, r24

Similar with either -Os or -O3

Also tried with the 4.7.2 in AVR studio and a 4.8.x from somewhere early last year.

All of them don't drop the 16 bit math. Some of them are just strange (as above).

Do I need a newer GCC?

If it's temperamental about it, probably just stick to SprinterSB advice that this one is best left in the "just better of with ASM" catergory.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

avr-gcc seems to have about the right idea regarding vram2,
but it explicitly loads the lower byte and does some strange things.
In the worst case, there should have been 5 instructions.
Perhaps casting eXH<<8 to 8 bits would help.
The best case would be 3 instructions.

What does the .s file look like?
I'd like to know what symbols some of those numbers came from.

Edit: (unsigned char)(eXH<<8)==0 .

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

Last Edited: Wed. Feb 5, 2014 - 04:52 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Skeeve, that is a cut and paste of the .lss file.

What else can I post to help?

BTW if I type cast to uchar it gets better

	 vram2[(unsigned char)(eXH<<8)|eXL] = 0x22;
    9e06:	e0 e0       	ldi	r30, 0x00	; 0
    9e08:	fc e0       	ldi	r31, 0x0C	; 12
    9e0a:	e0 70       	andi	r30, 0x00	; 0
    9e0c:	e8 0f       	add	r30, r24
    9e0e:	f1 1d       	adc	r31, r1
    9e10:	82 e2       	ldi	r24, 0x22	; 34
    9e12:	80 83       	st	Z, r24
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

OK found the -S switch

main:
.LFB2:
.LSM0:
	push r16
/* prologue: function */
/* frame size = 0 */
.LSM1:
	in r24,35-32
.LVL0:
.LSM2:
	in r25,38-32
.LVL1:
.LSM3:
	ldi r30,lo8(vram)
	ldi r31,hi8(vram)
	andi r30,lo8(-256)
	add r30,r24
	adc r31,__zero_reg__
	ldi r24,lo8(34)
.LVL2:
	st Z,r24
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
SprinterSB wrote:
or badly hack GCC
Type-punning and the like? Yeah, best avoided.

Any other hacking you can think of?

I meant hacking the GCC sources. Just the fact that an object is aligned does not justify to skip the high bits of address computation.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
I meant hacking the GCC sources.

Dear lord that sounds drastic.

Also I installed 4.8.2 from the MHV tools set. No joy there

I think we put this one in the ASM is marginally ahead basket.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Not real relevant for the last part of this but.
I downloaded studio 6.1 sunday, (upto now I have used a winAVR 10.10.10 or something like that), and the code it generate use a MUL by 8 for the <<3 (both for -OS and -O2) how do you get it to make "your" output code ?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:
Perhaps casting eXH<<8 to 8 bits would help.
Oops.
Casting to 8 bits makes it 0.
code semicolon comments by skeeve, andrewm1973 wrote:
OK found the -S switch

main:
.LFB2:
.LSM0:
	push r16
/* prologue: function */
/* frame size = 0 */
.LSM1:
	in r24,35-32    ; could have been r30
.LVL0:
.LSM2:
	in r25,38-32
.LVL1:
.LSM3:
	ldi r30,lo8(vram)  ; what I was looking for
	ldi r31,hi8(vram)
	andi r30,lo8(-256)
	add r30,r24   ; could have been a mov or nothing
	adc r31,__zero_reg__  ; could have been nothing
	ldi r24,lo8(34)
.LVL2:
	st Z,r24

BTW I fairly easily get 6 instructions and 7 cycles for the whole works without casting away eXH.
    in r30, PINB
    in r31, PINC
    andi r31, 3
    addi r31, hi8(vram)
    ldi r24, 0x22
    st Z, r24

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
Not real relevant for the last part of this but.
I downloaded studio 6.1 sunday, (upto now I have used a winAVR 10.10.10 or something like that), and the code it generate use a MUL by 8 for the <<3 (both for -OS and -O2) how do you get it to make "your" output code ?
It wasn't on AS (I use the plain old avr-gcc toolchain from Atmel with 4.8.1) but I saw that behaviour as well.

However, it was in a specific circumstance. I had compiled and linked (a test program with a skeletal main() which called SetRamTile()) in one step, but had not used -ffunction-sections -Wl,--gc-sections.

SetRamTile() was present as a callable function with a ret, but wasn't ever called. It was also inlined into main. The dead function contained the mul, but the live in-lined code used the 3 lsl. Don't know why. Didn't dig deeper.

Curiously, the mul approach has the same flash word cost, but a higher (by 1) cycle cost and greater register pressure since the results end up in r0/r1.

I expect there is an optimisation rule that catches this kind of thing, but not for dead code (why would it?)

JJ

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

R0/R1 does not add up to the register pressure because these registers are used implicitly only.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
I meant hacking the GCC sources
Yeesh.
Quote:
Just the fact that an object is aligned does not justify to skip the high bits of address computation.
I was thinking (inadvisably) of:
  unsigned char *vram2=(unsigned char*)(0xFF00 & (int)vram);
  uint16_t eX;

  *((uint8_t*)(&eX)+0) = PINB; // low
  *((uint8_t*)(&eX)+1) = PINC; // high

  vram2[eX] = 0x22; 

But I don't know how to keep the stack frame from getting involved:

int main (void) {
  90:	cf 93       	push	r28
  92:	df 93       	push	r29
  94:	00 d0       	rcall	.+0      	; 0x96 
  96:	cd b7       	in	r28, 0x3d	; 61
  98:	de b7       	in	r29, 0x3e	; 62

  unsigned char *vram2=(unsigned char*)(0xFF00 & (int)vram);
  uint16_t eX;

  *((uint8_t*)(&eX)+0) = PINB; // low
  9a:	83 b1       	in	r24, 0x03	; 3
  9c:	89 83       	std	Y+1, r24	; 0x01
  *((uint8_t*)(&eX)+1) = PINC; // high
  9e:	86 b1       	in	r24, 0x06	; 6
  a0:	8a 83       	std	Y+2, r24	; 0x02

uint8_t vram[256] __attribute__ ((__aligned__ (1024))) ;

int main (void) {

  unsigned char *vram2=(unsigned char*)(0xFF00 & (int)vram);
  a2:	e0 e0       	ldi	r30, 0x00	; 0
  a4:	f4 e0       	ldi	r31, 0x04	; 4
  a6:	ee 27       	eor	r30, r30
  uint16_t eX;

  *((uint8_t*)(&eX)+0) = PINB; // low
  *((uint8_t*)(&eX)+1) = PINC; // high

  vram2[eX] = 0x22; 
  a8:	89 81       	ldd	r24, Y+1	; 0x01
  aa:	9a 81       	ldd	r25, Y+2	; 0x02
  ac:	e8 0f       	add	r30, r24
  ae:	f9 1f       	adc	r31, r25
  b0:	82 e2       	ldi	r24, 0x22	; 34
  b2:	80 83       	st	Z, r24

}
  b4:	0f 90       	pop	r0
  b6:	0f 90       	pop	r0
  b8:	df 91       	pop	r29
  ba:	cf 91       	pop	r28

Yeesh.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Wed. Feb 5, 2014 - 05:44 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
R0/R1 does not add up to the register pressure because these registers are used implicitly only.
Does it not require that r1 be zeroed again? I suppose that's not 'register pressure' though...

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Correct. "Register pressure" would mean "take away the register(s) from other code so it cannot hold values in these registers".

The need to clear R1 adds to consumed time and code size.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

One might try replacing vram with ((unsigned char*)0xC00).

In any case, I suspect that SetPixelTile should be to made into a macro.
That would allow speedups through common sub-expression elimination.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Such hacks won't help. GCC does not perform such optimizations.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
Such hacks won't help. GCC does not perform such optimizations.
Wait... mine or skeeve's? Mine I expect...

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Closest I've got it is using 0xC0. If I mention vram it does the 16bit maths.

	unsigned char eXL = PINB;     // IN something from PINB and PINC
	unsigned char eXH = PINC;     // so the compiler doesn't optimize

	 eXH &= 3;
	 eXH |= 0x0C;

	 unsigned char *vram2 = (unsigned char *)((eXH<<8)|(eXL));
	 vram2[0] = 0x22;

Now if it could just not load into R24 and then move it into R31.

	unsigned char eXL = PINB;     // IN something from PINB and PINC
    9900:	e3 b1       	in	r30, 0x03	; 3
	unsigned char eXH = PINC;     // so the compiler doesn't optimize
    9902:	86 b1       	in	r24, 0x06	; 6

	 eXH &= 3;
    9904:	83 70       	andi	r24, 0x03	; 3
	 eXH |= 0x0C;
    9906:	8c 60       	ori	r24, 0x0C	; 12

	 unsigned char *vram2 = (unsigned char *)((eXH<<8)|(eXL));
    9908:	f0 e0       	ldi	r31, 0x00	; 0
    990a:	f8 2b       	or	r31, r24
	 vram2[0] = 0x22;
    990c:	82 e2       	ldi	r24, 0x22	; 34
    990e:	80 83       	st	Z, r24
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Mind the implicit integer promotions. Does -mint8 help?

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

-mint8 does not stop the r24 -> r31 waste.

It does however seem to break just about everything in the uzebox kernel so prob would not have been an option even if it helped.

Now that I have 4.8.2 set up - it seems to do better in some other things than 4.7.x

I will go through my list of things I did in ASM now and try pick out the next simplest/smallest example I would like to learn how to make C do as well or get close.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I've got the candles, I've got the chalk pentagram.  Let the necromancy begin.

 

Howdy all, I've been gone over a year doing stuff.  I'm back to optimizing C code again now.  My next one is about multiplying an 8 bit unsigned number with an 8 bit signed number and then right shifting the result by 8 bits (its fixed point).

 

	xAdd = (Z * xAdd)>>8;

The code in C works.  The little aliens fly around the screen exactly as I expected my fixed point maths would make them.  It however makes this code

 

int8_t  xAdd;
uint8_t Z;
.
.
.
xAdd = (Z * xAdd)>>8;
    mov	    r18, r20
    ldi	    r19, 0x00	; 0
    adiw    r24, 0x01	; 1
    add	    r24, r24
    adc	    r25, r25
    add	    r30, r24
    adc	    r31, r25
    ld	    r24, Z
    eor	    r25, r25
    sbrc    r24, 7
    com	    r25
    mul	    r18, r24
    movw    r20, r0
    mul	    r18, r25
    add	    r21, r0
    mul     r19, r24
    add	    r21, r0
    eor	    r1, r1
    mov	    r20, r21
    add	    r21, r21
    sbc	    r21, r21

I was doing this in a .S file but just yesterday I learnt how to do inline ASM (to save some MOV/RCALL/RETs) so will show my version here in that form

 

asm(   "mulsu	%r0, %r1 \n\t"
       "mov	%r0, r1\n\t"
       "clr	r1"
       : "+a" (xAdd)
       : "a" (Z)
);

Which predictably looks like this in the LSS file

 

asm(	"mulsu	%r0, %r2 \n\t"

    mulsu   r18, r23
    mov	    r18, r1
    eor     r1, r1

	"clr	r1"
	: "+a" (yAdd)
	: "a" (Z)

So - what am I doing wrong.  How do I hint to the compiler to make my (signed 8:0) * (unsigned 0:8) be close to 4 clock cycles ?  I know I am a pretty average C coder, but I am trying to improve that so I have to resort to ASM less often.
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Have you tried using any optimization level besides -O0? With this test program:

 

extern int8_t x;
extern uint8_t y;
int8_t mul(void)
{
    return (x * y) >> 8;
}

and compiled with -O1 I get this assembly output:

  10                    mul:
  11                    /* prologue: function */
  12                    /* frame size = 0 */
  13                    /* stack size = 0 */
  14                    .L__stack_usage = 0
  15 0000 3091 0000             lds r19,x
  16 0004 2091 0000             lds r18,y
  17 0008 3203                  mulsu r19,r18
  18 000a C001                  movw r24,r0
  19 000c 1124                  clr __zero_reg__
  20 000e 892F                  mov r24,r25
  21 0010 0895                  ret

Curiously, -O2, -O3, and -Os swap r18 and r19, but it ends up with the same results.

 

This is with avr-gcc version 4.8.2.

 

I say avr-gcc does a decent job with this task!

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The code I pasted above was with 4.8.2 with -Os.

 

Yes it does do a semi-reasonable effort if it is in a subroutine and operating on globals. That is 20 clocks when you include the RCALL, RET and the two ST to get the local/automatics into a global RAM address. So thats 4 clocks faster than just

 

result = (scale * size)>>8;

 

So so far I have

 

24 clocks = C code operating on variables already in register
20 clocks = C code calling a subroutine working on globals (include call/ret/shuffling)
15 clocks = ASM in a .S file being called as a subroutine (include call/ret/shuffling)
4 clocks = inline ASM using the registers the variables are already in

 

Is there not a way to get the compiler closer?  When I do a subroutine as Christop showed I get the same result as he did.  When I do it inline it blows out.

 

Joey Morin and others gave me some good pointers in the previous examples I had. In this one apart from it being a return() value straight away, I can not work out how to get the compiler to not promote the whole thing to 16 bits.

 

I guess the hint to look at is why does

 

return(X * Y) >> 8;

 

spot that the result is 8 bits in r1 but

 

result = (X * Y) >> 8;

 

tries to work on 16 bit values ?
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If it helps here is the code I am working on for this.  The MulSU function is called by around 10 different routines with different variables - not just here.

 

void moveXYCommon(ObjectDescStruct *objData){

	uint8_t Z;
	uint8_t lane;

	Z = pgm_read_byte(&zTable[objData->z]);					// Perspective transform the objects Z value

	objData->s = pgm_read_byte(&scaleTable[objData->z]);	// Get the scale of the object at given Z value

	lane = objData->laneHi;

	int8_t x = laneData[lane].xStart;						// Get the X/Y start position of the current lane
	int8_t y = laneData[lane].yStart;

	int8_t xAdd = laneData[lane].xGrad;						// Get the gradient of the lane
	int8_t yAdd = laneData[lane].yGrad;

	xAdd = (Z * xAdd)>>8;									// Work out how far the object is from the start of the lane
	yAdd = (Z * yAdd)>>8;

	//xAdd = MulSU(Z, xAdd);
	//yAdd = MulSU(Z, yAdd);

	//MulSU2(Z, xAdd, yAdd);

	objData->x = x + xAdd;									// Add the delta X/Y to the lane start position
	objData->y = y + yAdd;
}

 

The commented out lines are the equivelant ASM routine from a .S file and the below #define

 

#define MulSU1(Z, a) \
		asm(	"mulsu	%r0, %r1 \n\t"	\
				"mov	%r0, r1\n\t"	\
				"clr	r1"				\
				: "+a" (a)				\
				: "a" (Z)				\
		);

#define MulSU2(Z, a, b) \
		asm(	"mulsu	%r0, %r2 \n\t"	\
				"mov	%r0, r1\n\t"	\
				"mulsu	%r1, %r2 \n\t"	\
				"mov	%r1, r1\n\t"	\
				"clr	r1"				\
				: "+a" (a)				\
				, "+a" (b)				\
				: "a" (Z)				\
		);

 

As I said - I only just worked out how to do inline ASM yesterday, but it seems to be useful.

 

Maybe I have done something dumb in one of the lines above the multiplication?  I've tried a few different things to see if I can help it.  Splitting the (Z * xAdd) and the >>8 up only made it worse.
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Before using time on a other way, this question, does the compiler make better code if both are unsigned ?(Then the mul can be from any register, so perhaps it does it different).

 

And a side question to the experts are there a way to inline an ASM function?  

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

As you are fussing with this construct and cycle-counting, have you tried 8 bits times 8 bits to get a 16-bit result -- and then just force taking the high byte with a cast or union?  Perhaps you can trick the compiler this way.  (Actually, I call it "giving the compiler a hint" and imply your ASM mindset.)

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
And a side question to the experts are there a way to inline an ASM function?  

You can accomplish that either by using a #define or by defining a static inline function.

 

#define zbDsblInt()                               \
({                                                \
    zb_CPU_stat_t zb_flag;                        \
    __asm__ __volatile__                          \
    (                                             \
        "in %0, 0x3f"       "\n\t"                \
        "cli"               "\n\t"                \
        : "=r" (zb_flag)                          \
        :                                         \
        : "memory"                                \
    );                                            \
    zb_flag;                                      \
})

 

or

 

static inline uint8_t zbDsblInt(void) __attribute__((always_inline));
static inline uint8_t zbDsblInt(void)
{
    uint8_t zb_flag;
    __asm__ __volatile__
    (
        "in %0, 0x3f"       "\n\t"
        "cli"               "\n\t"
        : "=r" (zb_flag)
        :
        : "memory"
    );
    zb_flag;
}

 

Don Kinzer
ZBasic Microcontrollers
http://www.zbasic.net

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just for fun, I poked mul() into a CodeVision test program as an inline function.

 

The results came fairly close -- except for the call to the utility routine to do the signed/unsigned mixed multiply adjustment.

 

[I had to use CV's "Promote char to int?" option. ;)  ]

 

Now, if you go through the possibilities and don't force it to do the adjustment, then it might come out "clean"?

 

 

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I copied your latest code to a standalone C source file (with a few missing definitions added to make it compile). That code is at http://pastebin.com/DKin9rVS.

 

Try as I might, I couldn't get avr-gcc to produce the longer multiplication code sequence with anything but -O0. -Os/-O1/-O2/-O3 all produce only two mulsu instructions, one for each multiplication.

 

Can you copy and paste the exact avr-gcc command that is used to compile your code? I suspect that this source file is not getting the right optimization flag for some reason.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Theusch, it only does the crazy long thing when one of the arguments is unsigned.

Christop, thanks - I will download your example from pastebin and investigate my makefile to see if I am breaking it there somehow.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks Christop, It's defintly something in the makefile or environment for that project.

 

In a totally clean environment I get this

xAdd = (Z * xAdd)>>8;									// Work out how far the object is from the start of the lane
  mulsu   r21, r19
  movw    r24, r0
  eor     r1, r1
    
yAdd = (Z * yAdd)>>8;
  mulsu   r20, r19
  movw    r30, r0
  eor     r1, r1

Which is only one clock behind the inline ASM for the same thing - close enough for this part of the code that it will stay as C.

 

I have looked further and no where in my code that is doing similar fixed math is FMUL being used.  However in the kernel code FMUL is being used.  They are both being compiled with the same flags

 

CFLAGS = -mmcu=$(MCU) -nostartfiles
CFLAGS += -Wall -gdwarf-2 -std=gnu99 -DF_CPU=28636360UL -Os -fsigned-char -ffunction-sections -fno-toplevel-reorder
CFLAGS += -MD -MP -MT $(*F).o -MF dep/$(@F).d 
CFLAGS += -g3 -gdwarf-2

In fact in a clean environment even -O0 gives the almost optimal code.

 

The only difference I can see with my code and the kernel is that mine is chock full of things that look like this.

 

 __attribute__ ((section (".renderlinesa")))

 __attribute__ ((section (".freeflash")))

 __attribute__ ((section (".renderlinesb")))

 

They are both using

avr-gcc (GCC) 4.8.2 20131010

that came with MHVTools.

 

I'll get back here once I solve the mystery.
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Straight into another question.

 

I have a table in flash that is aligned to a 256 byte boundary for speed of access in my ASM

 

LDFLAGS += -Wl,--section-start=.trigtableflash=0x00002100
const uint8_t SinCosTable[] __attribute__ ((section (".trigtableflash"))) =  {

and the ASM for sin() and cos() looks like this

 

; int8_t CosFastC(uint8_t angle)
;
; Returns the cosine of the angle
;
; Inputs
;		Angle (0..255) in r24
; Returns
;		cos(angle) as signed 8 bit value in r24
; Trashes
;		R24
;		R26:27

CosFastC:

     subi    r24, (-(64))			; COS is 90 degrees out of phase with SIN

SinFastC:

     mov     r30, r24				; Get the offset in the SIN table
     ldi     r31, hi8(SinCosTable)
     lpm     r24, Z				; Read value from table into r24
     ret

I don't think this one is likely - but is there are a way to make the C code not do

 

R = pgm_read_byte(&SinCosTable[objData->r]);
     adiw    r26, 0x05	; 5
     ld      r30, X
     sbiw    r26, 0x05	; 5
     ldi     r31, 0x00	; 0
     subi    r30, 0x00	; 0
     sbci    r31, 0xDF	; 223
     lpm     r30, Z

It's the LDI/SUBI/SBCI that the alignment was intended to get rid of.

 

The compiler KNOWs the table is 8 bit boundary, can you convince it that it does not need to subtract ZERO from the address?

 

Last Edited: Fri. Apr 17, 2015 - 02:49 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Don't expect C to make much better code. 

But perhaps try a pointer that point to SinCosTable and add the index, (perhaps place it in a register pair.)

 

But why do you want to change it to C if it's already running in ASM ?

 

If you really hunt speed then make some inline code where speed really matter to save the call/ret.

In ASM change to use IJUMP (I know that might change the structure away form what C use, and I guess that your ASM does aswell) 

 

Last Edited: Fri. Apr 17, 2015 - 07:37 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

But why do you want to change it to C if it's already running in ASM ?

 

It's a fairly big game project for the Uzebox, I am trying to get as much of the non video rendering stuff into C as possible for other people to be able to read and learn from it.

 

The section of the game that renders all the video to screen and draws all the lines is in ASM as C won't be able to do that.

 

However the sections of the game that do

 

Object management

Object movement

Collision detection

Game logic

 

Are not so time critical and are mostly in C already.

 

The Sin/Cos thing is called often enough that I will code it as inline ASM though.  The inline ASM thing is not as intimidating as it first looked now I have bothered to work it out.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It's a fairly big game project for the Uzebox, I am trying to get as much of the non video rendering stuff into C as possible for other people to be able to read and learn from it.

 

If you make a good documentation of how it works, I would prefer clean ASM, than some C with all kind of compiler "hacks". 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes - anything that looks uglier in C to make it optimal won't be used.

 

For example the above code where #defining an inline ASM macro only saved one clock over

 

xAdd = (xAdd * Z) >> 8;
yAdd = (yAdd * Z) >> 8;

 

is out.  But the inline ASM to replace

 

xAdd = (SinFastC(thisLaneAngle + edgeAngle) * laneWidth) >> 8;
yAdd = (CosFastC(thisLaneAngle + edgeAngle) * laneWidth) >> 8;

is in.  Because not only is it a lot of clocks, it also is called quite often.

 

Also any ASM code is pretty well explained when people do want to look at it - as an example

 

; fast_line_convert_x0_y0_into_VRAM_address
;
; converts the X0 and Y0 address passed in r24 and r22 into a VRAM memory
; location and leaves this result in R26:27 (VRAM_Address)
;
; Inputs
;            Y0 address = R22
;            X0 address = R24
; Outputs
;            VRAM_Address = R26:27 (X)
;
; Requires that the constants 4 and 32 are pre-loaded in R10, R11
;
; Trashes R0:1

.macro fast_line_convert_x0_y0_into_VRAM_address
     mul     r22, r10                               ; Multiply Y0 by 4   y7y6y5y4y3y2y1y0      => .0.0.0.0.0.0y7y6:y5y4y3y2y1y0.0.0
     movw    r26, r0                                ; move the 16 bit result into VRAM_Address
     andi    r26, 0b11100000                        ; clear out the bits that are used for Xn     .0.0.0.0.0.0y7y6:y5y4y3.0.0.0.0.0
     mul     r24, r11                               ; Multiply X0 by 32  x7x6x5x4x3x2x1x0      => .0.0.0x7x6x5x4x3:x2x1x0.0.0.0.0.0
     or      r26, r1                                ; OR X7..3 into low byte of VRAM_Address      .0.0.0.0.0.0y7y6:y5y4y3x7x6x5x4x3
     subi    r27, hi8(-(vram))                      ; Add the base address of VRAM                .0.0.0.0.1.1y7y6:y5y4y3x7x6x5x4x3
.endm

And any code that is almost a direct equivalent of something from C has the C code above it in comments.

 

/*
void renderObjects(void){
     uint8_t i;
     ObjectDescStruct *Current;

     drawFunctionPointer_t drawFunction;

     for(i = 0; i < MAX_OBJS; i++) {
          Current = (ObjectDescStruct*)&ObjectStore[i];
          if(Current->obType != OBJ_EMPTY) {
               drawFunction = (drawFunctionPointer_t)pgm_read_word(&drawFunctionPointers[Current->obType]);
               drawFunction(Current);
          }
     }
}
*/

renderObjects:
     fast_line_entry_C                               ; save all the registers C does not want trashed
     fast_line_entry                                 ; set up all the registers for the draw line routines

     ldi		r28, lo8(ObjectStore)        ; Get the base address of the ObjectStore[] array
     ldi		r29, hi8(ObjectStore)
renderObjectsLoop:
     ld      r30, Y                                  ; Get the Object Type of the current object (r30 = ObjectStore[i].ObjType)
     cpi     r30, 0x00                               ; If the object type is <zero>
     breq    renderObjectsSkip                       ; we don't need to draw anything
     add     r30, r30                                ; Multiply the object number by 2 (flash addressing)
     clr     r31                                     ; and clear ZH (NB: Object Type can not be > 128 for this to work)
     subi    r30, lo8(-(drawFunctionPointers))       ; Add the base address to the function pointer table in PROGMEM
     sbci    r31, hi8(-(drawFunctionPointers))
     lpm     r26, Z+                                 ; Get the address of the code/routine that draws the given
     lpm     r27, Z+                                 ; object type into Register X
     movw    r30, r26                                ; move that value into Z ready for the ICALL
     icall                                           ; Call the routine that draws the object

renderObjectsSkip:
     adiw    r28, 0x08                               ; add 8 to Y to point to the next element of ObjectStore[]
     cpi     r29, 0x04                               ; if the address of Y hits 0x0400 we are at the end of the object store
     brne    renderObjectsLoop                       ; else loop back

     fast_line_exit_C                                ; restore the registers C needs to not be trashed
     ret

It is a fairly famous game I am porting to the Uzebox and I suspect a few people to be interested in how I managed it.
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

  xAdd = (Z * xAdd)>>8;                                    // Work out how far the object is from the start of the lane
  mulsu   r21, r19
  movw    r24, r0
  eor     r1, r1

 I'm actually pretty impressed.  I thought that mixed-size arithmetic (multiplying two bytes to get 16 bits) was one of the last holdout advantages for assembler. (that, and direct use of the carry bit.)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

But the biggest problem : you never know when the compiler change the code to something bigger!

And remember for the compiler you haven't done the >>8 yet! (it should have been something like mov r24,r1) , I guess it will add: mov r24,r25

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2,

 

straight after the

movw r24, r0

if does

add    r25, r19
std    Y+3, r25

so there is no extra move to get the >>8.

 

The only real waste in those few lines is the extra unnecessary eor r1,r1 it has to do again a few clocks later.

 

BTW - I reinstalled 4.8.2, cleaned up my project and removed a lot of orphaned code and files and I could get the compiler to come up with the mulsu code above.

 

HOWEVER the compiler compiler completly screwed up a pointer in another part of the code and had me chasing a phantom bug for 2 days.

 

Installed 4.9.2 20140912 and the compiler now gets the ASM correct.  I suspect it was this bug that has been fixed as at 4.9.0

 

https://gcc.gnu.org/bugzilla/sho...

 

In 4.8.2 the code loaded a memory address for a pointer into X, did some additions to X to access members of the struct, then called another routine with the trashed version of X.

 

In 4.9.2 the address is loaded into r16, copied to X, access the members trashing X, reload X from r16, then call other routine.
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm not sure that I get the hole thing. But as see it you don't need the movw r24, r0!

In ASM it should be done direct by storing  r1 (and I don't get the add r25,r19 it's not a +=)

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2,

 

As shown in post #50 the next C statement straight after them is an =

	xAdd = (Z * xAdd)>>8;		// Work out how far the object is from the start of the lane
	yAdd = (Z * yAdd)>>8;

	//xAdd = MulSU(Z, xAdd);
	//yAdd = MulSU(Z, yAdd);

	//MulSU2(Z, xAdd, yAdd);

	objData->x = x + xAdd;		// Add the delta X/Y to the lane start position
	objData->y = y + yAdd;
}

Now if I was optimizing the group of statements in ASM I just would have added R1 to R25.  I was only looking at the single

xAdd = (xAdd * Z)>>8;

here though.  My inline ASM version moved R1 somewhere first because it was an optimization of a single line.  The C version using MOVW is the same speed as MOV.  Just trashes R24 needlessly.

 

I know that if I did everything in ASM I would make faster code, but in this section of the game were speed is not as critical and readability is, I accept a clock here and a clock there slower.

 

In the bits of code that need to be fast

 

  • renderObjects
  • drawXXXX
  • bresenhamLine
  • setPixel

 

which are called hierarchically - I have been able to globally optimize register usage and very aggressivly inline stuff with assembly macros.  Having the carry flag and not having to comply with the ABI needing R24 always being used for arguments has allowed me to make that stuff very fast.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I think this will be the last one.  I have picked most of the low hanging fruit.  In fact even with this one I am probably going to leave it in C.  I am curious if there is a way to get C to use MUL or ignore the MSB.

 

I have a 16 bit pointer that is the address of a structure in an array.  I need to go back to the index of the array

 

In effect I am doing

 

i = ( (uint16_t)(objectPointer) - (uint16_t)(&objectArray[0]) ) / sizeOf(object);

 

Now I know

  • the result of this answer can be at most 59 (a 6 bit number)
  • The sizeOf() the structure is 8 bytes
  • The base address of the array is 0x0220 (fixed address from a section)
  • Therefore the highest address the pointer could be is 0x03F8 (0x220 + 59*8)
  • Therefore the highest value the result of the /8 can be is 127 (a 7 bit number)

 

So I know if I do the divide by 8 first, I can just use the lowest 8 bits and minus 0x44 (0x0220 / 8 = 0x44) hence the most optimal C I can make is

 

i = ((uint8_t)((uint16_t)(ob2) / 8)) - 0x44;

which gets compiled to

 

ldi   r24, 0x03	; 3          Do the lsr/ror loop 3 times
lsr   r23
ror   r22
dec   r24
brne  .-8
subi  r22, 0x44	; 68

Which is 16 clocks.

 

In ASM I can

 

ldi   r21, 32        ; value to multiply by
mul   r23, r21
mov   r22, r1        ; get the lowest 5 bits of result into destination
mul   r24, r21
or    r22, r0        ; or in the highest 2 useful bits into destination
eor   r1,  r1
subi  r22, 0x44 ; 68 

 

that keeps the full 8 bits of the result in 9 clocks OR because I know the answer is only 7 bits I can

 

lsr   r23
ror   r22
lsr   r23
ror   r22
lsr   r22    ; last itteration does not need ror as result is 7 bits
subi  r22, 0x44

which is the answer I need in this specific case in 6 clocks.

 

Any way of getting C to understand I only have 7 bits and the short LSR/ROR trick can be taken?  Or even make it do the full 3 LSR/ROR (no BRNE.-8) and blow the code size out by 1 word and still get the answer in 7 clocks?
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Can any one please tell me code vision avr is compatible with which usb programmer?its urgent

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Does -O3 unroll ?

If yes find which compiler flag that make the change.

Last Edited: Thu. Apr 23, 2015 - 08:05 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

perhaps only /4  and then /2 (but it's dangers next version perhaps see that it can be done as one operation )

because you only need 7 bit you know that /4 still can be hold the byte. 

 

 

Last Edited: Thu. Apr 23, 2015 - 09:36 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

And I don't understand that the compiler make the loop that way, instead of using the carry flag for the brance, it will be same size but save a clk for each loop, and no use of a register.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sparrow2, -03 does unroll it but I have to use -0s to fit all my game into the mega644.  The difference between -O3 and -Os is about 4K in size saving.  When I include the music assets in the game I have about 300 bytes left.

 

I have just been reading some more about GCC and optimizing.  Looks like I have to put some of these functions in a seperate C file (fast.c) I can run -O3 on them, but leave the non critical stuff in slow.c and then link them.  There seems to be no way to just ask the compiler to switch optimize levels around bits of code.

 

But your idea of a /4 and /2 does work

 

ob1->animation = ((uint8_t)((uint8_t)((uint16_t)(ob2) / 4)) / 2) - 68;

becomes

 

    96e8:	76 95       	lsr	r23
    96ea:	67 95       	ror	r22
    96ec:	76 95       	lsr	r23
    96ee:	67 95       	ror	r22
    96f0:	66 95       	lsr	r22
    96f2:	64 54       	subi	r22, 0x44	; 68

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

And I don't understand that the compiler make the loop that way, instead of using the carry flag for the brance, it will be same size but save a clk for each loop, and no use of a register.

 

you mean to

 

andi   r22, 0x03
ori    r22, 0x04
lsr    r23
ror    r22
brcc   .-6
subi   r22, 0x44

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

correct instructions wrong numbers! (AND to keep data and then the OR the "counter")

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

OK - another one.

extern uint8_t spibyte_ff(void);

int mmcGetInt(void){
    return( spibyte_ff() | spibyte_ff()<<8 );
}

comes out as

int mmcGetInt(void){
    push    r28

return( spibyte_ff() | spibyte_ff()<<8 );

    call    0xa9c4      ; 0xa9c4 <spibyte_ff>
    mov     r28, r24
    call    0xa9c4      ; 0xa9c4 <spibyte_ff>
    mov     r18, r28
    ldi     r19, 0x00   ; 0

}
    movw    r20, r18
    or      r21, r24
    movw    r24, r20
    pop     r28
    ret

and I would like it to look more like

 

    call    spibyte_ff
    push    r24
    call    spibyte_ff
    pop     r25
    ret

Technically in the ASM version I don't need to push/pop as I know R25 is untouched by spibyte_ff, but I am keeping it C/ABI compatible and assuming R25 could get trashed.

 

 

I'm using GCC 4.9.2 with -Os.  I have an almost full Mega644 and am going through C code and trying to save an extra few K to fit in another song now.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I don't expect that it can come all the way down to that.

But first try (I don't have your compiler here) to replace the | with + (with no overlap it's the same)

And try to make a local U16 you assign to.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Had already tried

 

uint16_t i;

((char*)&i)[0] = mmc_get_byte();
((char*)&i)[1] = mmc_get_byte();
return i;

and

 

return( mmc_get_byte() + mmc_get_byte()<<8 );

and

 

return( mmc_get_byte() + mmc_get_byte()*256 );

The one with OR ended up the best.

 

Just wanted to check here that there was not something else I didn't know about to help.  I have learned a LOT about how the C compiler thinks here, but I still know very little.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I would still try with a local variable, and spread it over 2 lines.

that way you should at least avoid the push and pop.  
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

I have learned a LOT about how the C compiler thinks here

That's the problem a compiler don't think wink

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm using GCC 4.9.2 with -Os.  I have an almost full Mega644 and am going through C code and trying to save an extra few K to fit in another song now.

That seems pretty sensible.   You can always change to an ATmega1284 if you run out of space.

 

I have no idea what your "song" might be.    Any .WAV or .MP3 is going to use more flash than any AVR could supply.

Adding a microSD or even an external Flash memory will provide extra memory.

 

I can quite understand your desire to minimise program size.    I was expecting you to be struggling with a 2kB or 8kB device that had no bigger brothers.

 

Yes,  you can save a few bytes by altering some function usage.    The general approach is:

1. list functions by size.   look at the greedy ones first.

2. use appropriate width of variables and scope.

3. avoid f-p maths and printf() functions.

4. use smaller algorithms.

 

I suspect that those 4 points would give you a significant saving.   Minor tweaks to one function call will only save tens of bytes.

 

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Note that code like this has undefined (or implementation defined at best) behavior:

return( mmc_get_byte() + mmc_get_byte()<<8 );

 

C doesn't specify the order in which the functions are called so the compiler might call the second function first.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

christop wrote:

Note that code like this has undefined (or implementation defined at best) behavior:

return( mmc_get_byte() + mmc_get_byte()<<8 );

C doesn't specify the order in which the functions are called so the compiler might call the second function first.

 

Thanks for the reminder.  I have read this before.

 

sparrow2 wrote:

I would still try with a local variable, and spread it over 2 lines.

that way you should at least avoid the push and pop.  

 

OK - just tried splitting it to two lines.  Slightly different result, but same size.

 

david.prentice wrote:
You can always change to an ATmega1284 if you run out of space.

 

Sadly no.  It is a game for the Uzebox.  The user base already all have Mega644s.  Squeezing the last drop of blood from the stone is the only option.

 

david.prentice wrote:
I have no idea what your "song" might be.    Any .WAV or .MP3 is going to use more flash than any AVR could supply.

Adding a microSD or even an external Flash memory will provide extra memory.

 

It is a "MOD" like song with the notes being synthesised rather than being WAVs. There is only one WAV which is sampled speech of the word "Play"

 

I think I have moved all the "graphics assets" I can to SD card.  Some of them can not be moved to SD card - for example the message saying "SD card not found" :)

 

david.prentice wrote:
1. list functions by size.   look at the greedy ones first.

 

Done.  Partially at least.  The greediest functions are giant 12,000 line long unrolled ASM fucntions, but they exist so the game can be done at all.

 

david.prentice wrote:
2. use appropriate width of variables and scope.

 

I think I have been programming space constrained systems long enough to have this one down pat :)

 

david.prentice wrote:
avoid f-p maths and printf() functions.

 

All fixed point and integer using as few as bits as can get the task done.

 

No printf or atoi() stuff.  My font table is even non-ascii to save time/space printing my BCD and HEX values to screen.

 

david.prentice wrote:
 Minor tweaks to one function call will only save tens of bytes.

 

Have all the usual major tweaks you mention in hand.

 

So far with the minor tweaks like this one - I have saved 400ish bytes.  That is one third of the way to being able to include a 3rd song in the game.

 

Uze is trying to see if he can modify the kernel to make the sampled speech 4 bits rather than 8 bits.  That would save another 800 bytes.

 

So if I keep plodding along I might get there.

 

For reference my memory budget is

 

  • 4096 Bootloader (I can't touch)
  • 16436 the 12 thousand lines of ASM code that draws pixels to the screen
  • 7137 Music assets including the sampled speech
  • 4918 Fast line drawing routines and vector objects
  • 1280 3D point and vector data
  • 780 Object behaviour data
  • 768 trig, scaling and zoom tables
  • 512 font data that can not be stored on SD card
  • 226 Text that can not be stored on SD card
  • 474 free space (and counting)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

What do you have in the eeprom?

And why does the bootloader need to be that big?

Last Edited: Sat. Aug 1, 2015 - 10:36 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Bootloader is written by someone else and contains

 

Video and sound generation (NTSC + mono 15Khz)

SD Card and FatFS reading of games stored on SD card

Menu system to select which new game to play

Flash programming stuff as normal for a bootloader

 

The EEPROM is shared space that all games written for the uzebox have to share.  It is "formatted" and arbitrated by the "kernel" code you have to include/link to your own game.

 

I myself use 64 bytes of the EEPROM for a highscore table.  Its generally considered poor form to use more than 32 bytes, but my game is impressive enough I think it warrants 2 "blocks" :)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

OK - this one has just moved to a (dot)s file.  Using the knowledge that mmc_get_byte only touches r24, r30, r31 AND the fact that GetInt is a subset of GetLong let me shrink it a bit.

 

.section .text.mmcGetLong
mmcGetLong:
    rcall   mmc_get_byte     ; First byte from SD card straight to R22
    mov     r22, r24
    rcall   mmc_get_byte     ; Second byte from SD card straight to R23
    mov     r23, r24
                             ; Fall through to GetInt: to receive 3rd and 4th bytes
.section .text.mmcGetInt
mmcGetInt:
    rcall   mmc_get_byte     ; First byte from SD card to temp location in R20 (3rd byte of GetLong)
    mov     r20, r24
    rcall   mmc_get_byte     ; Second byte from SD card to temp location in R21 (4th byte of GetLong)
    mov     r21, r24    
    movw    r24, r20         ; Move R20:21 to the R24:25 location C is expecting it
    ret

.section .text.mmcGetChar    ; Simple fall through to get_byte
mmcGetChar:

.section .text.mmc_get_byte
mmc_get_byte:
    rcall   spibyte_ff
    .
    .
    .

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It is still going to stay in ASM now, because I have made the effort and the ASM is 1/2 the size when considering getInt and getLong together.

 

But it was bugging the hell out of me that I could not get it to just OR two bytes together into a i16 without all the mess.

 

This si the best I have come up with.

 

uint16_t mmcGetInt(void){

union{
    struct {
        uint8_t i1;
        uint8_t i2;
    };
    uint16_t i16;
} ii;

ii.i1 = getbyte();
ii.i2 = getbyte();

return(ii.i16);

}

And that comes out as

 

push    r28
call    0xd900   ; 0xd900 <getbyte>
mov     r28, r24

call    0xd900   ; 0xd900 <getbyte>
mov     r25, r24

mov     r24, r28
pop     r28
ret

It is only one push/pop behind the ASM (considering getInt alone)

 

Does anyone know if there is a way to hint the compiler that "getByte" only touches r24 so it can avoid the push/pop and just use some register pair like R22:23 ?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Looking at your "budget",   I would start by optimising the 16kB of ASM.    That is clearly the greediest part.

 

Quite honestly,  there is little point in optimising mmcGetInt(void).    It is the calling sequence that matters.     e.g. RCALL mmcGetInt / MOV r24 / MOV r25 i.e. 3 words.

Even that could be reduced to 1 word if the subsequent code "knows" it is going to use r24, r25.

Even if the actual mmcGetInt is in an external file (so the linker wants CALL) you can put in a local trampoline.

 

You are an experienced ASM programmer.    You must know all the standard techniques for reducing code.    You also know that 95% of code space is not speed critical.   It is only the 5% that is used for 90% of the time e.g. inside loops,  common subroutines.

 

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

david.prentice wrote:
I would start by optimising the 16kB of ASM.    That is clearly the greediest part.

 

The 16K is not optimizable at all and they whole game hinges on it.  Without it there is no game and the rest of the exersize is pointless.  I am willing to be proven wrong here if anyone thinks they can save any space on it but I think that 16K can't have a single BYTE saved from it.

 

david.prentice wrote:
Quite honestly,  there is little point in optimising mmcGetInt(void).

 

I know any individual routine is never going to save heaps of space.  But when you have done ALL the rest you can think of you have to try them.

 

I have done a lot of optimizing so far that I have not posted questions here about because I figured them out myself.  So far I have now saved 600 bytes from optimizing all the little things like GetInt.  They ADD UPP.

 

I managed to reduce getInt and getLong from 106 bytes (first un optimized C) down to 20 bytes in ASM. (My best C got down to 46 bytes I think)

 

The other nice thing about my "optimizing the little things" like the SD card reading functions here, is that when I push all my little things back into the "kernel" I will save space on everyone else Uzebox game that is written in the future. 

 

david.prentice wrote:
You also know that 95% of code space is not speed critical.

 

This is actually an unusual case that more than %50 of my flash space is taken up with speed critical things.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

And no one picked me up on the obvious saving of another 4 bytes I missed. (I noticed it myself when I was cleaning up comments)

 

.section .text.mmcGetLong
mmcGetLong:
  rcall  mmcGetInt       ; First two bytes from SD card and move to R22:23
  movw   r22, r24
                         ; Fall through to GetInt to receive 3rd and 4th bytes
.section .text.mmcGetInt
mmcGetInt:
  rcall  mmc_get_byte    ; First byte from SD card to temp location in R20 (3rd byte of GetLong)
  mov    r20, r24
  rcall  mmc_get_byte    ; Second byte from SD card to temp location in R21 (4th byte of GetLong)
  mov    r21, r24
  movw   r24, r20        ; Move R20:21 to the R24:25 location C is expecting it
  ret

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

andrewm1973 wrote:
Uze is trying to see if he can modify the kernel to make the sampled speech 4 bits rather than 8 bits.  That would save another 800 bytes.
I've done that for this:

https://www.avrfreaks.net/comment/1628051#comment-1628051

Also supports 2-bit, but it sounds awful ;-)

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks joey,

 

I am pretty sure that 4 bit will sound fine for the sample as it is played at the same time as 3 other channels of 8 bit generated sound in the music.

 

The problem for Uze will be if he can fit the 3 note channels, the one noise channel and the 4 bit WAV channel into the 130 clocks that HSync runs for.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Should be easily done in assembler.  Currently you have:

8-bit:

LPM: 3 cycles

 

You'll want:

4-bit:

low nibble:

LPM: 3 cycles

MOV: 1 cycle

AND: 1 cycle

high nibble:

SWAP: 1 cycle

AND: 1 cycle

You'd also need to track which nibble you're working on, so some extra cycles would be spent setting, testing, and clearing a flag somewhere.  I assume all the low-hanging fruit like GPIOR0 are already spoken for, but perhaps there's something else available.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Thu. Aug 6, 2015 - 12:00 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

<Thread Necromancy>

 

Hi again all,

 

Need help giving the compiler hints again that are human friendly.

 

The following code compiles and works fine.  Just a bit slow.  As I have said previously I like to keep code in C rather than moving it to ASM if I can keep performance within 5% or so.  Unless of course the C is more confusing and less readable than the ASM version.

 

typedef struct {
	uint8_t next;
	uint8_t col;
	uint8_t y1;
	uint8_t x1Lo;
	uint8_t x1;
	int16_t m;
	uint8_t y2;
} GradLineStruct;

void Add_Line(uint8_t col, uint8_t y1, uint8_t x1, int16_t m, uint8_t y2){

	if (linkNext == 255) { return; }                                        // If we have run out of space to add new lines then fail

	if (linkNext != 0) { gradLines[linkNext].next = linkNext + 1;	};      // If we are pointing to array index[0] then don't update "next"
                                                                            //   index[1] is the first valid entry that can be used.
																			
	linkNext++;                                                             // Update the "Next" to point to the current index (it was pointing to last)
	
	gradLines[linkNext].next = 0;                                           // Insert the values
	gradLines[linkNext].col  = col;
	gradLines[linkNext].y1   = y1;
	gradLines[linkNext].x1   = x1;
	gradLines[linkNext].x1Lo = 0;
	gradLines[linkNext].m    = m;
	gradLines[linkNext].y2   = y2;
}

And the LSS is

 

void Add_Line(uint8_t col, uint8_t y1, uint8_t x1, int16_t m, uint8_t y2){
    2a18:	0f 93       	push	r16

	if (linkNext == 255) { return; }                                        // If we have run out of space to add new lines then fail
    2a1a:	a0 91 c7 01 	lds	r26, 0x01C7	; 0x8001c7 <linkNext>
    2a1e:	af 3f       	cpi	r26, 0xFF	; 255
    2a20:	d9 f0       	breq	.+54     	; 0x2a58 <Add_Line+0x40>
    2a22:	e1 e0       	ldi	r30, 0x01	; 1
    2a24:	ea 0f       	add	r30, r26

	if (linkNext != 0) { gradLines[linkNext].next = linkNext + 1;	};      // If we are pointing to array index[0] then don't update "next"
    2a26:	aa 23       	and	r26, r26
    2a28:	39 f0       	breq	.+14     	; 0x2a38 <Add_Line+0x20>
    2a2a:	98 e0       	ldi	r25, 0x08	; 8
    2a2c:	a9 9f       	mul	r26, r25
    2a2e:	d0 01       	movw	r26, r0
    2a30:	11 24       	eor	r1, r1
    2a32:	a0 50       	subi	r26, 0x00	; 0
    2a34:	bc 4f       	sbci	r27, 0xFC	; 252
    2a36:	ec 93       	st	X, r30
                                                                            //   index[1] is the first valid entry that can be used.
																			
	linkNext++;                                                             // Update the "Next" to point to the current index (it was pointing to last)
    2a38:	e0 93 c7 01 	sts	0x01C7, r30	; 0x8001c7 <linkNext>
	
	gradLines[linkNext].next = 0;                                           // Insert the values
    2a3c:	98 e0       	ldi	r25, 0x08	; 8
    2a3e:	e9 9f       	mul	r30, r25
    2a40:	f0 01       	movw	r30, r0
    2a42:	11 24       	eor	r1, r1
    2a44:	e0 50       	subi	r30, 0x00	; 0
    2a46:	fc 4f       	sbci	r31, 0xFC	; 252
    2a48:	10 82       	st	Z, r1
	gradLines[linkNext].col  = col;
    2a4a:	81 83       	std	Z+1, r24	; 0x01
	gradLines[linkNext].y1   = y1;
    2a4c:	62 83       	std	Z+2, r22	; 0x02
	gradLines[linkNext].x1   = x1;
    2a4e:	44 83       	std	Z+4, r20	; 0x04
	gradLines[linkNext].x1Lo = 0;
    2a50:	13 82       	std	Z+3, r1	; 0x03
	gradLines[linkNext].m    = m;
    2a52:	36 83       	std	Z+6, r19	; 0x06
    2a54:	25 83       	std	Z+5, r18	; 0x05
	gradLines[linkNext].y2   = y2;
    2a56:	07 83       	std	Z+7, r16	; 0x07
}
    2a58:	0f 91       	pop	r16
    2a5a:	08 95       	ret

 

I have no idea WHY the PUSH R16 at the start of the routine and the POP R16 at the end OR the superfluous SUBI R31, 0x00 and OER R1,R1 but that's not the question here.

 

If you look at the C code - you can see the two array access locations are sequential.  I have been able to trick the compiler into using the larger indexes to access the second one like this

 

typedef struct {
	uint8_t next;
	uint8_t col;
	uint8_t y1;
	uint8_t x1Lo;
	uint8_t x1;
	int16_t m;
	uint8_t y2;
} GradLineStruct;

typedef struct {
	uint8_t last;
	uint8_t dummy1;
	uint8_t dummy2;
	uint8_t dummy3;
	uint8_t dummy4;
	int16_t dummy5;
	uint8_t dummy6;
	uint8_t next;
	uint8_t col;
	uint8_t y1;
	uint8_t x1Lo;
	uint8_t x1;
	int16_t m;
	uint8_t y2;
} GradLineStructX2;

void Add_Line(uint8_t col, uint8_t y1, uint8_t x1, int16_t m, uint8_t y2){

	GradLineStructX2 *KludgePointer;
	
	if (linkNext == 255) { return; }                                        // If we have run out of space to add new lines then fail

	KludgePointer = (GradLineStructX2*)&gradLines[linkNext];
	
	if (linkNext != 0) { KludgePointer->last = linkNext + 1;	};          // If we are pointing to array index[0] then don't update "next"
                                                                            //   index[1] is the first valid entry that can be used.
																			
	linkNext++;                                                             // Update the "Next" to point to the current index (it was pointing to last)
	
	KludgePointer->next = 0;                                                // Insert the values
	KludgePointer->col  = col;
	KludgePointer->y1   = y1;
	KludgePointer->x1   = x1;
	KludgePointer->x1Lo = 0;
	KludgePointer->m    = m;
	KludgePointer->y2   = y2;
}

Which compiles to this

 

void Add_Line(uint8_t col, uint8_t y1, uint8_t x1, int16_t m, uint8_t y2){
    2a18:	0f 93       	push	r16
    2a1a:	d9 01       	movw	r26, r18

	GradLineStructX2 *KludgePointer;
	
	if (linkNext == 255) { return; }                                        // If we have run out of space to add new lines then fail
    2a1c:	90 91 c7 01 	lds	r25, 0x01C7	; 0x8001c7 <linkNext>
    2a20:	9f 3f       	cpi	r25, 0xFF	; 255
    2a22:	a1 f0       	breq	.+40     	; 0x2a4c <Add_Line+0x34>

	KludgePointer = (GradLineStructX2*)&gradLines[linkNext];
    2a24:	28 e0       	ldi	r18, 0x08	; 8
    2a26:	92 9f       	mul	r25, r18
    2a28:	f0 01       	movw	r30, r0
    2a2a:	11 24       	eor	r1, r1
    2a2c:	e0 50       	subi	r30, 0x00	; 0
    2a2e:	fc 4f       	sbci	r31, 0xFC	; 252
    2a30:	31 e0       	ldi	r19, 0x01	; 1
    2a32:	39 0f       	add	r19, r25
	
	if (linkNext != 0) { KludgePointer->last = linkNext + 1;	};              // If we are pointing to array index[0] then don't update "next"
    2a34:	91 11       	cpse	r25, r1
    2a36:	30 83       	st	Z, r19
                                                                            //   index[1] is the first valid entry that can be used.
																			
	linkNext++;                                                             // Update the "Next" to point to the current index (it was pointing to last)
    2a38:	30 93 c7 01 	sts	0x01C7, r19	; 0x8001c7 <linkNext>
	
	KludgePointer->next = 0;                                           // Insert the values
    2a3c:	10 86       	std	Z+8, r1	; 0x08
	KludgePointer->col  = col;
    2a3e:	81 87       	std	Z+9, r24	; 0x09
	KludgePointer->y1   = y1;
    2a40:	62 87       	std	Z+10, r22	; 0x0a
	KludgePointer->x1   = x1;
    2a42:	44 87       	std	Z+12, r20	; 0x0c
	KludgePointer->x1Lo = 0;
    2a44:	13 86       	std	Z+11, r1	; 0x0b
	KludgePointer->m    = m;
    2a46:	b6 87       	std	Z+14, r27	; 0x0e
    2a48:	a5 87       	std	Z+13, r26	; 0x0d
	KludgePointer->y2   = y2;
    2a4a:	07 87       	std	Z+15, r16	; 0x0f
}
    2a4c:	0f 91       	pop	r16
    2a4e:	08 95       	ret

Now it still has the superfluous PUSH/POP/SUBI - but has gotten down in size by 10%.

 

But it does look kludgy.

 

Can a Compiler+ASM guru show me a better way to make the 2nd array access not do the expensive re-calculate of the array

?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

BTW - here is my ASM version

; void Add_Line(uint8_t col, uint8_t y1, uint8_t x1, int16_t m, uint8_t y2){
;
;   if (linkNext == 255) { return; }                                        // If we have run out of space to add new lines then fail
;
;   if (linkNext != 0) { gradLines[linkNext].next = linkNext + 1;	};      // If we are pointing to array index[0] then don't update "next"
;                                                                           //   index[1] is the first valid entry that can be used.
;
;   linkNext++;                                                             // Update the "Next" to point to the current index (it was pointing to last)
;
;   gradLines[linkNext].next = 0;                                           // Insert the values
;   gradLines[linkNext].col  = col;
;   gradLines[linkNext].y1   = y1;
;   gradLines[linkNext].x1   = x1;
;   gradLines[linkNext].x1Lo = 0;
;   gradLines[linkNext].m    = m;
;   gradLines[linkNext].y2   = y2;
; }
;
; Notes:
;   The  "if (linkNext == 255) { return; }"  test is done quite late.  After address conversion and linkNext++.  This makes that
;   case slower than could be if the comparison was done early.  This however would make the more common and important case
;   when linkNext != 255 one clock slower.  You don't really care how slow the ==255 case is as you have run out of memory to
;   add more lines at thise stage anyway.
	
AddLine:	
	lds     r23, linkNext              ; Fetch the global variable "linkNext" that is the current head pointer to gradient_lines list
	ldi     r21, 0x08
	mul     r23, r21                   ; Multiply "linkNext" by sizeof() the structure for the address conversion step
	movw    R30, R0                    ; Move the result of the MUL into Z early so we can restore r1 to <zero>
	eor     r1, r1                     ; r1 = 0
	inc     r23                        ; INC "linkNext".  The result of this is to be stored AND used for the ==255 check
	breq    Add_Line_Exit              ; If "linkNext" was ==255 prior to the above INC then the result would be ZERO so we return()
	sts     linkNext, r23              ; Save the value of the global variable "linkNext"
	subi    r31, 0xFC                  ; Finish of the array
address conversion by adding the base address of the array ; only the high byte needs to be added as the array is 256 byte aligned. ldi r21, 0x01 cpse r23, r21 ; Compare "linkNext + 1" to 0x01. If this is true then the value of "linkNext" before the ; inc was 0x00 and we should skip over the next instruction st Z, r23 ; array[linkNext].next = linkNext + 1 std Z+8, r1 ; array[linkNext + 1].next = 0 std Z+9, r24 ; array[linkNext + 1].col = col std Z+10, r22 ; array[linkNext + 1].y1 = y1 std Z+11, r1 ; array[linkNext + 1].x1lo = 0 std Z+12, r20 ; array[linkNext + 1].x1 = x1 std Z+13, r18 ; array[linkNext + 1].m = m std Z+14, r19 ; 2nd byte of 16 bit value m std Z+15, r16 ; array[linkNext + 1].y2 = y2 Add_Line_Exit: ret

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

for the hole picture:

you pass something close to the structure  <GradLineStruct> that must generate a lot of code before the call !

 

so perhaps pass a pointer to GradLineStruct variable that hold the values you want!

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 - thanks for the reply.

 

This function is the most likely delineation between what would be considered "user code" and "Kernel code"

 

At a lower level than this I have code that sorts the polygons into a Y sorted list, converts that list of lines in the form Y= mX+c, uses that Y sorted list to create a Run Length Encoded RAM buffer while doing some interrupt ended code that "races the beam"

 

I want to hide all that incredibly complex stuff from anyone wanting to use this RLE mode for writing their own game. 

 

So "AddPolyLine" is the point at which I stop being responsible for data structures and how the code works and how the user wants to write their code.  Think of it as an API call.

 

I have no control, nor do I want to impose upon the user, how they are going to store/treat col, y1, x1, m and y2.

 

The video mode itself is not as amazing as T2K, but has the potential to be the 2nd most amazing video output code done on the AVR if I have not made any mistakes on my spreadsheet.

 

(Help with optimizing stuff for T2K was probably how this thread started BTW)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The reason for the PUSH/POP of R16 is because of this..

 

https://gcc.gnu.org/wiki/avr-gcc...

 

The ABI dictates that R2-R17 must be saved. Usually the compiler sticks to the ...

 

https://gcc.gnu.org/wiki/avr-gcc...

 

But if the code requires it to use SO many registers it will start to use call saved and then, as you see, it has to save them.

 

If you pass in a struct rather than all the individual dimensions you will reduce the register usage.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Clawson,

 

I am well aware of call-saved and call-clobbered.

 

R16 is not clobbered at all.  It is passed in.  Pushed to the stack.  Used in an Store and then popped off the stack.  A pointless waste of a push and pop because it is not altered between the push and pop.

 

Kind of like the first EOR r1,r1 is pointless because r1 is not used as <zero> between then and its next trashing by the second MUL.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes but the compiler will just follow internal rules such as "if R2-R17 used then PUSH/POP"

 

BTW the compiler is an open project, all improvements/fixes welcome from all who paid to use it.

 

(Oh, wait a minute, no one actually pays do they?)

Last Edited: Sun. Apr 22, 2018 - 11:08 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

So, r16 gets the y2 variable due to the calling convention. This variable is not written to, but still the compiler sees fit to save it.

Sure , the compiler is being dumb here, nothing I've never seen before.

 

I don't see any elegant way to solve this. Just an inelegant  way, which is to fuse 2 uint8_t variables into a uint16_t, so that r16 is not used, y2 will go to r18 instead which is not call saved.

 

void Add_Line(uint8_t col, uint16_t y1_x1, int16_t m, uint8_t y2){
    ...
    gradLines[linkNext].y1   = y1_x1 >> 8;
    gradLines[linkNext].x1   = y1_x1 | 0xFF;
    ...
}

 

 

Then you would have to call it with

Add_Line(col, y1<<8 | x1, m, y2);

 

Ugly, right? But maybe in this case the compiler will be able to avoid the push/pop.

 

edit: you can make it less ugly with a macro, I guess.

 

#define ADD_LINE(col,y1,x1,m,y2) Add_Line(col, y1<<8 | x1, m, y2)

 

edit 2: and when I say the compiler is being dumb, I don't intend to aim it at the developers. GCC is immensely complex and sometimes, there is no easy way to fix things like this. In fact, those guys are heroes in my opinion, I fear the day when SprinterSB can/will no longer develop avr-gcc...

Last Edited: Sun. Apr 22, 2018 - 11:32 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

El Tangas,

 

Thanks or the suggestion.  I can live with the push pop.  I was grumbling about it, but it was not the actual question I asked.  That was a derailment and its my own fault as I should learn to ignore posts of negative usefulness.

 

The original question was if there is a more elegant way to get the compiler to access two sequential elements of an array with compile time maths (adding to the displacement).

 

I worked out a kludgy way but I would like to know if there is a better way.  This is about self improvement and me trying to make myself a better C programmer.

 

Sprinter, Sparrow, Skeeve, Joey, Westf and several others have been very helpful in the past.  As I have learnt more things from them I have had to resort to ASM less often in the games/projects I am doing.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ok, I see the problem. I don't think I have a better solution, maybe I would just write it differently:

 

typedef struct {
    GradLineStruct last_line;
    GradLineStruct next_line;
} GradLineStructX2;
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Logical pairs of 1-byte variables can be placed in a struct without too much kludginess.

With macros and gcc's ({...}) mechanism

one could define a syntax without too much user pain.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles