GCC vs ASM code size - Help me close the gap

Go To Last Post
104 posts / 0 new

Pages

Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Now - before anyone claims this is a set up - you can look on the uzebox forums. This code was posted on the 27th Jan which was 2 days before I first spoke on the lazarus troll thread.

Also I HAVE asked here several times before for help in writing better C so the compiler knows how to optimize what I am doing. So this is in form with what I have asked in the past.

The troll thread DID prompt me to ask this question though. I am curious if GCC can get close and if the way I am writing the argument is sub optimal.

Again - I do write C code. I like to write C code. I was formally taught to write C code in a university environment. All that said I know I am a pretty average C coder.

void SetRamTile(uint8_t x, uint8_t y, uint8_t tile)
{
  x = x & 0x1f;
  y = y & 0x1f;
  vram[((y>>3)*256)+x<<3+(y&7)] = tile;
}

becomes

226:      {
+000060BB:   716F        ANDI      R22,0x1F       Logical AND with immediate
229:        vram[((y>>3)*256)+x<<3+(y&7)] = tile;
+000060BC:   2FF6        MOV       R31,R22        Copy register
+000060BD:   95F6        LSR       R31            Logical shift right
+000060BE:   95F6        LSR       R31            Logical shift right
+000060BF:   95F6        LSR       R31            Logical shift right
+000060C0:   E0E0        LDI       R30,0x00       Load immediate
+000060C1:   718F        ANDI      R24,0x1F       Logical AND with immediate
+000060C2:   0FE8        ADD       R30,R24        Add without carry
+000060C3:   1DF1        ADC       R31,R1         Add with carry
+000060C4:   E070        LDI       R23,0x00       Load immediate
+000060C5:   7067        ANDI      R22,0x07       Logical AND with immediate
+000060C6:   7070        ANDI      R23,0x00       Logical AND with immediate
+000060C7:   5F6D        SUBI      R22,0xFD       Subtract immediate
+000060C8:   4F7F        SBCI      R23,0xFF       Subtract immediate with carry
+000060C9:   C002        RJMP      PC+0x0003      Relative jump
+000060CA:   0FEE        LSL       R30            Logical Shift Left
+000060CB:   1FFF        ROL       R31            Rotate Left Through Carry
+000060CC:   956A        DEC       R22            Decrement
+000060CD:   F7E2        BRPL      PC-0x03        Branch if plus
+000060CE:   5EE0        SUBI      R30,0xE0       Subtract immediate
+000060CF:   4FFE        SBCI      R31,0xFE       Subtract immediate with carry
+000060D0:   8340        STD       Z+0,R20        Store indirect with displacement
230:      }
+000060D1:   9508        RET                      Subroutine return

Now I have tried replacing the

(y>>3)*256

part with several other variations including a struct/union that has 2x uint8 sharing with a uint16. None of them ended up better, only different.

In fact the original version was

vram[((y>>3)*256)+8*x+(y&7)]

which I changed the (8*x) to (X<<3) and that did yield an improvement.

Anyways, the ASM code that gets the same result is

;***********************************
; SET TILE 8bit mode
; C-callable
; r24=X pos (8 bit)
; r22=Y pos (8 bit)
; r20=Tile No (8 bit)
;************************************
.section .text.SetTile
SetTile:

#if SCROLLING == 1
;index formula is vram[((y>>3)*256)+8x+(y&7)]

;                         r22               r24               XH               XL                T
                        ; y7y6y5y4y3y2y1y0  x7x6x5x4x3x2x1x0  - - - - - - - -  - - - - - - - - 

lsl   r24               ; y7y6y5y4y3y2y1y0  x6x5x4x3x2x1x0 0
lsl   r24               ; y7y6y5y4y3y2y1y0  x5x4x3x2x1x0 0 0
lsl   r24               ; y7y6y5y4y3y2y1y0  x4x3x2x1x0 0 0 0

mov   XH,r22            ;                                     y7y6y5y4y3y2y1y0 
lsl   XH                ;                                     y6y5y4y3y2y1y0 0
swap  XH                ;                                     y2y1y0 0y6y5y4y3
andi  XH,3              ;                                     0 0 0 0 0 0 y4y3

andi  r22, 7            ; 0 0 0 0 0 y2y1y0
or    r24, r22          ;                   x4x3x2x1x0y2y1y0
mov   XL, r24           ;                                     0 0 0 0 0 0 y4y3 x4x3x2x1x0y2y1y0
subi  XL,lo8(-(vram))
sbci  XH,hi8(-(vram))

st   X,r20

ret

#else

Now I would be surprised if any compiler did the LSL/SWAP bit as that normally would not save you any time if not for the fact that bits5..7 are don't cares.

But is there a way I can hint the compiler to get it better than it is?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
x = x & 0x1f;
y = y & 0x1f;

No fair; your assembly code doesn't do that (having noticed that it's unneeded.

But yeah; if you look at the code, and follow some of the C discussions, you'll notice that some of the inefficiencies are due to this rule that C has about making intermediate results be "int" in complex expressions. So by eliminating complex expressions:

void SetRamTile2(uint8_t x, uint8_t y, uint8_t tile)
{
//  x &= 0x1F;
//  y &= 0x1F;
  uint8_t xt = x<<3;
  uint8_t yt = y>>3;
  xt = xt + (y & 7);
  vram[(yt*256)+xt] = tile;
} 

You get more compact code:

00000000 <_Z11SetRamTile2hhh>:
   0:   88 0f           add     r24, r24
   2:   88 0f           add     r24, r24
   4:   88 0f           add     r24, r24   ;; x<<3
   6:   26 2f           mov     r18, r22
   8:   27 70           andi    r18, 0x07   ;; y & 7
   a:   28 0f           add     r18, r24   ;; y&7 + x<<3
   c:   96 2f           mov     r25, r22  ;; y again
   e:   96 95           lsr     r25
  10:   96 95           lsr     r25
  12:   96 95           lsr     r25    ;; y >> 3
  14:   80 e0           ldi     r24, 0x00  ;; *256
  16:   82 0f           add     r24, r18 ;; add (now 16b)
  18:   91 1d           adc     r25, r1
  1a:   e0 91 00 00     lds     r30, 0x0000  ; vram
  1e:   f0 91 00 00     lds     r31, 0x0000
  22:   e8 0f           add     r30, r24
  24:   f9 1f           adc     r31, r25
  26:   40 83           st      Z, r20
  28:   08 95           ret
Last Edited: Mon. Feb 3, 2014 - 02:59 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Westfw,

Thanks for that. It makes sense why my all my attempts to help the compiler out failed - knowing it was doing int anyways.

BTW - I would still need to keep the

y &= 0x1F;

at least.

It is there for safety. The end user _should_ not pass in X or Y values greater than 31. If they do though - then some random bit of RAM is going to get trashed.

The ASM version I posted will always leave XH as

0 0 0 0 0 0 Y4Y3

The C code there will leave XH as

0 0 0 y7y6y5y4y3

Thats gotten it down from 23 to 20 words and a LOT of clock cycles faster (I'm not going to add them up)

Down from 40% smaller to 30% smaller and even bigger gap closing in the speed stakes.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Seems like you're imposing a mathematical approach when you've acknowledged that at its root it's really a bit mapping problem:

0 0 0 0 0 0 y4y3 x4x3x2x1x0y2y1y0

So why hobble the compiler with the math?

void SetRamTile(uint8_t x, uint8_t y, uint8_t tile) {
  uint8_t hi = (y>>3) & 3;
  uint8_t lo = (x<<3) | (y&7);
  vram[(hi<<8) | lo] = tile;
}
00000000 :
  vram[(yt*256) + xt] = tile;
}
#else
void SetRamTile(uint8_t x, uint8_t y, uint8_t tile) {
  uint8_t hi = (y>>3) & 3;
  uint8_t lo = (x<<3) | (y&7);
   0:   96 2f           mov     r25, r22
   2:   97 70           andi    r25, 0x07       ; 7
   4:   88 0f           add     r24, r24
   6:   88 0f           add     r24, r24
   8:   88 0f           add     r24, r24
   a:   89 2b           or      r24, r25
  yt = (y>>3) & 3;
  vram[(yt*256) + xt] = tile;
}
#else
void SetRamTile(uint8_t x, uint8_t y, uint8_t tile) {
  uint8_t hi = (y>>3) & 3;
   c:   66 95           lsr     r22
   e:   66 95           lsr     r22
  10:   66 95           lsr     r22
  12:   63 70           andi    r22, 0x03       ; 3
  uint8_t lo = (x<<3) | (y&7);
  vram[(hi<<8) | lo] = tile;
  14:   e8 2f           mov     r30, r24
  16:   f0 e0           ldi     r31, 0x00       ; 0
  18:   f6 2b           or      r31, r22
  1a:   e0 50           subi    r30, 0x00       ; 0
  1c:   f0 40           sbci    r31, 0x00       ; 0
  1e:   40 83           st      Z, r20
  20:   08 95           ret

andrewm1973 wrote:
Thats gotten it down from 23 to 20 words and a LOT of clock cycles faster (I'm not going to add them up)
Your assembler version takes 14 words and 18 cycles, the above C takes 17 words and 21 cycles, both including the ret.

So C is 3 words/cycles larger/slower, or about 21% larger and 17% slower, than your assembler.

Where do those 3 come from?

- The C version uses ldi/or an where a mov would do, making for 1 wasted word/cycle due to the 16-bit nature of the final expression. Type punning could likely get rid of that, although that's a bad idea of course.

- There's an unnecessary mov to r25 when it could have gone to r30 straight away.

- The lsl/swap technique buys you one cycle.

I've seen the compiler do some pretty clever things with swap, not sure why it's not doing it here.

JJ

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thank you also JJ.

Quote:
So why hobble the compiler with the math?

((y>>3)*256)+8x+(y&7)

was what I was given by someone else.

I instantly realised it was a bit mapping problem when I converted it to ASM.

I did experiment in the C though and I did see a difference between *8 and <<3 in the long complex equation.

Before I knew that breaking it into smaller parts helped I did try both *256 and <<8. Both where equally bad when it was treating it as an int16.

So it ended up as

((y>>3)*256)+x<<3+(y&7)

rather than

((y>>3)*256)+8*x+(y&7)

Quote:
I've seen the compiler do some pretty clever things with swap, not sure why it's not doing it here

The SWAP/LSL only saves the one clock because the higher bits are don't-cares. I didn't think the compiler would get that even after you guys have shown me where I was going wrong.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

andrewm1973 wrote:
I instantly realised it was a bit mapping problem when I converted it to ASM.
You performed two independent tasks:
- redefining the problem into more suitable terms
- (re)writing the solution in another language

These are actually two separate activities. You can perform one without performing the other. Performing both and pointing at one language's success over another is stacking the deck (I know that's not what you're saying... Thank you by the way for forking the debate... It was getting a bit hot in there :) )

While thinking about and programming solutions to problems is invariably (and to some degree inextricably) related to the choice of language, the best approach in one language is generally the best approach in another.

Even more so when the two languages share so much in common. C has been referred to as a collection of fancy assembler macros. If memory serves, that statement has been made both by ASM and C programmers. It has also been refuted by both.

While I see the reasoning behind the statement (and refutations), I think it underscores the root problem: Humans don't know how to think 'outside the box'. Yes, that's a broad statement. Intentionally made for it's irony ;) (ask Alanis).

Too often it seems these debates (like the one from which you forked) miss the point before they start. Never mind the language. What's the best solution to the problem? Indeed, what is the problem?

Once you have an idea how best to tackle the problem, asking yourself in which language the solution should be implemented is often almost as irrelevant as asking what colour hammer you should use to pound in this nail. I like the blue ones.

Granted that's an oversimplification. And there are real differences that can't be ignored. Lisp isn't great for expressing procedural solutions, C can't access the carry bit, and ASM has no notion of type. But doesn't that just mean that when I want to put in an upholstery tack I use an upholsterer's hammer, and when I want to put in a finishing nail I use a trim hammer and a nail set?

Too late to say I don't want to get too philosophical... but I think if people spent more time framing the problem and less time implementing the solution there would be fewer flame wars (and real ones?).

Quote:
I did experiment in the C though and I did see a difference between *8 and <<3
Why did you stop there? You saw an improvement, and you'd already "instantly realised it was a bit mapping problem" when you wrote the ASM.

This is a pretty clear example of how a language (programming and human) or our perception of it can sometimes blind us. I'm just as guilty, by the way ;)

I have now done perhaps exactly what you'd hoped to avoid: brought your thread back around to the 'debate' instead of dealing with 'the problem' and discussing the solutions. If you'd rather I had not done that, let me know and I'll carve the philosophy out and post it into the original troll-thread. I think I'll use one of these...

Quote:
I didn't think the compiler would get that even after you guys have shown me where I was going wrong.
So give it the hint:
  uint8_t hi = y<<1;
  hi = (hi>>4) | (hi<<4);
  hi = hi&3;
  uint8_t hi = y<<1;
   0:   e6 2f           mov     r30, r22
   2:   ee 0f           add     r30, r30
  hi = (hi>>4) | (hi<<4);
   4:   e2 95           swap    r30
  hi = hi&3;
   6:   e3 70           andi    r30, 0x03       ; 3

I'm still pretty sure I've seen it do this on it's own... but of course I can't make it happen now :?

In any case this doesn't result in a smaller/faster function, the compiler does an additional mov later that nullifies the gain. I haven't dug any deeper to see if it can be dispatched.

JJ

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

BTW - I would still need to keep the

Code:
y &= 0x1F;

at least.

It is there for safety.


Just a side note, because it often comes up in this type of discussion:

I certainly won't disagree to write protective/robust code. But it is often found, as noted above, that when we get "you can't do this in C; see this much shorter ASM sequence" that the same problem is not being solved.

When working on the ASM sequence, the guru that thinks in machine code is going to use brain processes "... this reg can only be this big a value at this time so I can now apply the next thing in an unsigned manner and I know it is only a few bits and won't overflow and ...". The compiler needs to best do this by applying rules and heuristics and such.

So indeed when you have an example such as yours, and it is an "important" piece (as yours is) then you have to use the same processes in your head to give the compiler a hint, or tell it to override a "rule" or two.

Referring back to the other thread, using this type of process has gotten C "close". Not e.g. x2.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This is what I got as a fairly straight-forward hand assembly of the C code.

#define x  R24
#define y  R22
#define tile R20
    .global SetTile
SetTile:
    ANDI x, 0x1F
    ANDI y, 0x1F
    LDI  R30, 7
    AND  R30, y
    LSL x
    LSL x
    LSL x
    ADD  R30, x  ; OR would have same result
    LSR y
    LSR y
    LSR y
    LDI R31, hi8(vram)
    ADDI R30, lo8(vram)
    ADC  R31, y
    ST Z, tile
    RET

16 words and 20 cycles counting the return.
I could have saved a cycle and a word using swap.
OTOH my selection of registers targets might be regarded as mildly clever.
OYAH I didn't use any information not given the compiler.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:

You performed two independent tasks:
- redefining the problem into more suitable terms
- (re)writing the solution in another language

Yes. Although in my head and to me that always looks like one task. I am guilty of not spotting them as two tasks because of my own familiarity with working with bits.

I guess a bit of my problem there also is that I am expecting too much of the compiler. It shows I don't know enough about compilers to write my own even though I can write quite acceptable ASM code :)

I though that the compiler would know that bits 5,6 and 7 are don't cares because of the "AND 0x1F" the same way I just know it.

joeymorin wrote:
Quote:

I did experiment in the C though and I did see a difference between *8 and <<3

Why did you stop there? You saw an improvement, and you'd already "instantly realised it was a bit mapping problem" when you wrote the ASM.

When I tried <<8 vs *256 the compiled code got no smaller and got more confusing. Now I know this was an artifact of doing it all in one line rather than breaking up the lines. I think I probably would have gotten pretty close in the end if I had of not choosen the wrong path (complex equation) to start with.

joeymorin wrote:

hi = (hi>>4) | (hi<<4);

For <<4 or >>4 swap is always an option. When you have to shift by 3 - swap/lsl or lsr won't always win.

theusc wrote:

So indeed when you have an example such as yours, and it is an "important" piece (as yours is) then you have to use the same processes in your head to give the compiler a hint, or tell it to override a "rule" or two.

Well to me it is not that important code. It is part of the Uzebox library for doing tile based games like mario/zelda and I have always thought those games a bit naff.

It _IS_ library code though so it deserves to be as fast as possible.

It does demonstrate a point along lines similar to what joeymorin made above.

Uze/Alec wrote the original ASM code used in the library. His ASM was faster than his C.

I rewrote the thing in ASM based off his C. My ASM was faster than his ASM.

People here that know more about the compiler than I do have rewritten the same function in C.

The new C code from here is not as fast as the minimal ASM, but it is exactly the same (21 cycles) as the ASM code that was in the library.

Thank you again all. I am learning to think like the compiler a bit better.

If no one minds I may continue to post in this thread other examples of C vs ASM I have done recently. It can help prove how bad I am with AVR-C but more importantly it might be a good reference to point future C vs ASM flame wars at (as long as we can manage to keep this thread flame free)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

andrewm1973 wrote:
I guess a bit of my problem there also is that I am expecting too much of the compiler.
Perhaps. A (any) compiler is after all merely (rather, impressively) a collection of rules and when to apply them. A human (or several) had to come up with those rules, decide when to apply them, and write code to implement those rules and whens. A daunting task. I'm often surprised at how much it gets right.

In this respect a compiler is a knowledge based expert system built upon the experience of other programmers, most of whom are experts at C and other languages, including assembly, often on multiple platforms. Thanks to their efforts you may wield an impressively destructive hammer with relative ease ;)

We should remember though that we have only been talking about AVR GCC, itself based on the multi-platform, multi-target, multi-language GCC. A generalist in a room full of specialists.

I'd be interested to see how other compilers deal with the various versions of your code. I have no others on my machine, but I expect some would fair better, some would fair worse.

As is, GCC is only 3 cycles away from the currently best ASM solution offered, and some or all of them may yet disappear if an actual GCC guru chimes in. The lsl/swap technique is reachable with the right hints (as shown above), if only the later (and likely unrelated) mov can be suppressed. It might be a simple matter of the right command-line option to the compiler.

Quote:
I though that the compiler would know that bits 5,6 and 7 are don't cares because of the "AND 0x1F" the same way I just know it.
So did I. I know I've seen something similarly clever, but can't conjure it up.

Quote:
I am learning to think like the compiler a bit better.
That is a colourful crayon in anyone's box, but a case could be made that you should't have to think like the compiler. HLLs are meant to free us from such restrictions. On embedded systems, certainly bare-metal systems like AVR8, we may never be able to fully enjoy that freedom. That doesn't bother me ;)

Quote:
as long as we can manage to keep this thread flame free)
I promise :)

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
andrewm1973 wrote:
as long as we can manage to keep this thread flame free)
I promise :)

OK - well here is my next question to the smart C people in the "closing the gap" series.

The ASM detractors will love this one. They can say "Look at all the effort the ASM monkey had to go to to save one clock cycle"

The previous routine to get an X/Y tile was from Uze/Alecs original library code for a TILE video mode.

This next bit is from the video mode that was modified to get Asteroids going.

One of the things done for that video mode was to locate the VRAM array at a specific location.

in the Makefile

LDFLAGS += -Wl,--section-start=.ramtiles=0x00800400
LDFLAGS += -Wl,--section-start=.vram=0x00800C00

declaring the variables

unsigned char ramTiles[256*8]  __attribute__ ((section (".ramtiles")));
unsigned char vram[32*28] __attribute__ ((section (".vram")));

and here is my macro that converts an X/Y pixel location to a X/Y tile location.

; fast_line_convert_x0_y0_into_VRAM_address
;
; converts the X0 and Y0 address passed in r24 and r22 into a VRAM memory
; location and leaves this result in R26:27 (VRAM_Address)
;
; Inputs 
;           Y0 address = R22
;           X0 address = R24
; Outputs
;           VRAM_Address = R26:27 (X)
; 
; Requires that the constants 4 and 32 are in R28, R29
;
; Trashes R0:1

.macro fast_line_convert_x0_y0_into_VRAM_address
    mul  r22, r28           ; Multiply Y0 by 4   y7y6y5y4y3y2y1y0      => .0.0.0.0.0.0y7y6:y5y4y3y2y1y0.0.0
    movw r26, r0            ; move the 16 bit result into VRAM_Address
    andi r26, 0b11100000    ; clear out the bits that are used for Xn     .0.0.0.0.0.0y7y6:y5y4y3.0.0.0.0.0
    mul  r24, r29           ; Multiply X0 by 32  x7x6x5x4x3x2x1x0      => .0.0.0x7x6x5x4x3:x2x1x0.0.0.0.0.0
    or   r26, r1            ; OR X7..3 into low byte of VRAM_Address      .0.0.0.0.0.0y7y6:y5y4y3x7x6x5x4x3
    subi r27, hi8(-(vram))  ; Add the base address of VRAM                .0.0.0.0.1.1y7y6:y5y4y3x7x6x5x4x3
.endm

This is effectively

VRAM[(((Y&0xF8)>>3)<<8|(X>>3)]

Not asking about optimizing C bit manipulations here. You guys have shown me the light there.

This is all about the line

    subi r27, hi8(-(vram))  ; Add the base address of VRAM                .0.0.0.0.1.1y7y6:y5y4y3x7x6x5x4x3

Because I placed VRAM where I wanted it on a 10 bit boundary. I can add the base address of VRAM to the calculated array address with just one subi. This is as opposed to two subtracts as the previous example.

subi  XL,lo8(-(vram)) 
sbci  XH,hi8(-(vram))

So now the obvious question. How do I hint/convince/trick GCC into making

VRAM[Address]

only bother to add/sub the top 8 bits?

It is again something I thought C would optimize away itself automagically. The base address is 0x0c00. The bottom 8 bits is 0x00. That is a constant. It should not need to add 0x00 to something to get the result.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

It is again something I thought C would optimize away itself automagically

But the C compiler doesn't know the address. That is only resolved when the linker operates. Now it used to be the case that all the optimisation happened at compile time but recently the linker got Link Time Optimisation (LTO) so you may want to explore what this offers.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
But the C compiler doesn't know the address. That is only resolved when the linker operates.
Quote:

I get in the normal situation the compiler doesn't doesn't know the address so outputs symbols and the linker joins the dots.

I thought that I was telling linker and the compiler what address to use up front though.

Does this mean there is no way to get the compiler to take advantage of the memory alignments I have done?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I thought that I was telling linker and the compiler what address to use up front though.

Why would the compiler be interested?
Quote:

Does this mean there is no way to get the compiler to take advantage of the memory alignments I have done?

What did your exploration of LTO turn up? I don't know if it will help or not as I don't know what link time optimisations are done for AVR but it is probably the best hope.

I guess the LTO journey probably starts here:

http://gcc.gnu.org/wiki/LinkTime...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
What did your exploration of LTO turn up?

I read the page you linked and at least two levels deep linking away from any blue text on that page.

I understood nearly none of it.

Remember I am the person that isn't smart enough to know that writing

unsigned char vram[32*28] __attribute__ ((section (".vram")));

In the source file IS NOT actually telling the compiler anything.

I shan't give up though. I will try google some of the words I didn't understand on that wiki page and see if I can't bang some knowledge into my head. I might need to borrow one of joeys hammers.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

In the source file IS NOT actually telling the compiler anything.

It is telling the compiler something. It is saying "when you create this object" add a .section directive to the Asm source file - nothing more, nothing less.

If I use that very line then the Asm generated is:

.global	vram

	.section	.vram,"aw",@progbits
	.type	vram, @object
	.size	vram, 896
vram:
	.zero	896

If I remove the section attribute from the C source then the Asm code is:

	.comm	vram,896,1

As you can see neither conveys any information about alignment or position in memory. That decision doesn't come until later when either the linker script or a -section-start tells the linker where this will be placed. The C compiler has already generated Asm source to access it by this point (assuming you wrote some C to do so). Now there is a chance that LTO may recognise some aspect of the positioning and access that allows it to make assumptions about index register loads or whatever. Then again LTO is principally for the gcc "big iron" and it's also fairly new so it may not do a great job for AVR just yet until an AVR linker expert works on it a bit.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

How many places call this code?
If not to many force the compiler to make it inline, then you will save at least 7 clk.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
unsigned char *vram2=(unsigned char*)(0xFF00 & (int)vram);

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If the need for speed is great enough,
one might convert the function into a macro.
If that is not good enough, one might dispense with optimizing single tile settings and optimize loops instead.

for(unsigned char y=...) {
    // y in 0..0x1F
    for(unsigned char x=...) {
        tile=...;
        vram[(y>>3)*0x100 + (y&7) + (unsigned char)(x<<3)]=tile;
    }  // x
} // y


for(unsigned char x=...) {
    for(unsigned char yhi=...) {
        // yhi in 0..3
        for(unsigned char ylo=...) {
            unsigned char y=yhi*8+ylo;
            tile=...;
            vram[(unsigned char)8*x+yhi*0x100+ylo]=tile;
        } // ylo
    }  // yhi
} // x

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
Then again LTO is principally for the gcc "big iron" and it's also fairly new so it may not do a great job for AVR just yet until an AVR linker expert works on it a bit.
I've fiddled with LTO, although not extensively.

In 4.7.2 it would in some cases make code significantly less efficient.

4.8.1 appears to have improved, but I haven't spent a lot of time with it.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
How many places call this code?
If not to many force the compiler to make it inline, then you will save at least 7 clk.

The code I posted is at the heart of the SetPixel routine that is within the the bresenham code. So there are 18+ places that can call it. As I am going for speed over size in-lining is assumed :) There is 64 kilobytes so I can afford to waste 500 duplicating that code 20 times.

It was not the whole code block I was interested in though. It was just that one little part. Saving one clock cycle from VRAM access. That one construct to access the VRAM memory

VRAM[address];

is something I could not figure out how to hint/trick the compiler to do the obvious thing that the ASM does. In fact it is 1/2 the reason for me aligning the VRAM on a 10 bit boundary.*

I am asking here to "close the gap". I have it about as fast as (I think) it's ever going to happen on an AVR8.

I am hoping to learn how to make my C better/closer to ASM.

skeeve wrote:

unsigned char *vram2=(unsigned char*)(0xFF00 & (int)vram);

OK. I hardly understand that.

A temporary variable vram2
vram2 is a pointer to an u_char
it becomes the vram address masked with 0xFF00

I will try it in the code to see how it works out.

* The other 1/2 reason for the 10 bit alignment is the code that puts the pixel stream out to the TV. The pixel render code also gets the same 1 clock advantage. It is one of those ASM only - C could never do it things. It only has 5 clocks per pixel. A one clock advantage to it is significant.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

andrewm1973 wrote:
How do I hint/convince/trick GCC into making VRAM[Address] only bother to add/sub the top 8 bits?
When you have reached that level, you want to use assembler (likely) or badly hack GCC (unlikely).

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

andrewm1973 wrote:
The code I posted is at the heart of the SetPixel routine that is within the the bresenham code. So there are 18+ places that can call it. As I am going for speed over size in-lining is assumed :) There is 64 kilobytes so I can afford to waste 500 duplicating that code 20 times.
Is this the case for SetRamTile() as well? If so, GCC does a better job when inlining, because arguments are likely to be moved into the correct register immediately instead of comforming to the calling convention:
void SetRamTile(uint8_t x, uint8_t y, uint8_t tile) {
  uint8_t hi = (y>>3) & 3;
  uint8_t lo = (x<<3) | (y&7);
  vram[(hi<<8) | lo] = tile;
}

int main (void) {
  while (1) {
    SetRamTile(GPIOR0, GPIOR1, GPIOR2);
  }
}
00000090 
: int main (void) { while (1) { SetRamTile(GPIOR0, GPIOR1, GPIOR2); 90: 9b b5 in r25, 0x2b ; 43 92: 8a b5 in r24, 0x2a ; 42 94: ee b3 in r30, 0x1e ; 30 void SetRamTile(uint8_t x, uint8_t y, uint8_t tile) { uint8_t hi = (y>>3) & 3; uint8_t lo = (x<<3) | (y&7); 96: ee 0f add r30, r30 98: ee 0f add r30, r30 9a: ee 0f add r30, r30 9c: 28 2f mov r18, r24 9e: 27 70 andi r18, 0x07 ; 7 a0: e2 2b or r30, r18 void SetRamTile(uint8_t x, uint8_t y, uint8_t tile) { uint8_t hi = (y>>3) & 3; a2: 86 95 lsr r24 a4: 86 95 lsr r24 a6: 86 95 lsr r24 a8: 83 70 andi r24, 0x03 ; 3 uint8_t lo = (x<<3) | (y&7); vram[(hi<<8) | lo] = tile; aa: f0 e0 ldi r31, 0x00 ; 0 ac: f8 2b or r31, r24 ae: e0 50 subi r30, 0x00 ; 0 b0: ff 4f sbci r31, 0xFF ; 255 b2: 90 83 st Z, r25 b4: ed cf rjmp .-38 ; 0x90

The meat is now down to 15 words and 16 cycles. There is still a word/cycle wasted by an ldi/or where a mov would suffice, and the lsl/swap trick is unsurprisingly still absent, but the margin is now down to 2 words/cycles over the optimal ASM.

Quote:
A temporary variable vram2
vram2 is a pointer to an u_char
it becomes the vram address masked with 0xFF00

I will try it in the code to see how it works out.

I haven't tried it, but I expect care will be required in order to not incur additional overhead in the form of mov/movw instructions, and you'll likely still need to break the complex expression down as before.

Quote:
It is one of those ASM only - C could never do it things.
Never say never ;) ...

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
or badly hack GCC
Type-punning and the like? Yeah, best avoided.

Any other hacking you can think of?

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
andrewm1973 wrote:
A temporary variable vram2
vram2 is a pointer to an u_char
it becomes the vram address masked with 0xFF00

I will try it in the code to see how it works out.

I haven't tried it, but I expect care will be required in order to not incur additional overhead in the form of mov/movw instructions, and you'll likely still need to break the complex expression down as before.

I could not get the suggested pointer tricks to come in any better than my baseline shown below

uint8_t eXL = PINB;     // IN something from PINB and PINC
uint8_t eXH = PINC;     // so the compiler doesn't optimize

eXH &= 3;

vram[eXH<<8|eXL] = 0x22;

Skeeve - Any tips to how to change the above with your pointer suggestion?

joeymorin wrote:
Quote:
It is one of those ASM only - C could never do it things.
Never say never ;) ...

I feel pretty confident I can say never here.

Willing to be proven wrong though.

I can start up another thread about how the 4.1 colour video mode ASM works. I probably would even offer a prize/reward for any C guru that can duplicate it without using any ASM.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

andrewm1973 wrote:
I can start up another thread about how the 4.1 colour video mode ASM works. I probably would even offer a prize/reward for any C guru that can duplicate it without using any ASM.

Now your talking. :wink: Many here like a good challenge. :)

"I may make you feel but I can't make you think" - Jethro Tull - Thick As A Brick

"void transmigratus(void) {transmigratus();} // recursio infinitus" - larryvc

"It's much more practical to rely on the processing powers of the real debugger, i.e. the one between the keyboard and chair." - JW wek3

"When you arise in the morning think of what a privilege it is to be alive: to breathe, to think, to enjoy, to love." -  Marcus Aurelius

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

andrewm1973 wrote:

uint8_t eXL = PINB;     // IN something from PINB and PINC
uint8_t eXH = PINC;     // so the compiler doesn't optimize

eXH &= 3;

vram[eXH<<8|eXL] = 0x22;

Skeeve - Any tips to how to change the above with your pointer suggestion?


unsigned char *vram2=(unsigned char*)(0xFF00 & (int)vram);
uint8_t eXL = PINB;     // IN something from PINB and PINC
uint8_t eXH = PINC;     // so the compiler doesn't optimize

eXH &= 3;

vram2[eXH<<8|eXL] = 0x22;

vram2 is a pointer whose high byte can be found with "LDI , hi8(vram)"
and whose low byte does not need finding because it is 0.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Skeeve,

When I tried that I ended up with

	unsigned char *vram2=(unsigned char*)(0xFF00 & (int)vram);
	 uint8_t eXL = PINB;     // IN something from PINB and PINC
    9e02:	23 b1       	in	r18, 0x03	; 3
	 uint8_t eXH = PINC;     // so the compiler doesn't optimize
    9e04:	46 b1       	in	r20, 0x06	; 6

	 eXH &= 3;

	 vram2[eXH<<8|eXL] = 0x22;
    9e06:	e0 e0       	ldi	r30, 0x00	; 0
    9e08:	fc e0       	ldi	r31, 0x0C	; 12
    9e0a:	e0 70       	andi	r30, 0x00	; 0
    9e0c:	94 2f       	mov	r25, r20
    9e0e:	93 70       	andi	r25, 0x03	; 3
    9e10:	80 e0       	ldi	r24, 0x00	; 0
    9e12:	30 e0       	ldi	r19, 0x00	; 0
    9e14:	82 2b       	or	r24, r18
    9e16:	93 2b       	or	r25, r19
    9e18:	e8 0f       	add	r30, r24
    9e1a:	f9 1f       	adc	r31, r25
    9e1c:	82 e2       	ldi	r24, 0x22	; 34
    9e1e:	80 83       	st	Z, r24

Similar with either -Os or -O3

Also tried with the 4.7.2 in AVR studio and a 4.8.x from somewhere early last year.

All of them don't drop the 16 bit math. Some of them are just strange (as above).

Do I need a newer GCC?

If it's temperamental about it, probably just stick to SprinterSB advice that this one is best left in the "just better of with ASM" catergory.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

avr-gcc seems to have about the right idea regarding vram2,
but it explicitly loads the lower byte and does some strange things.
In the worst case, there should have been 5 instructions.
Perhaps casting eXH<<8 to 8 bits would help.
The best case would be 3 instructions.

What does the .s file look like?
I'd like to know what symbols some of those numbers came from.

Edit: (unsigned char)(eXH<<8)==0 .

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

Last Edited: Wed. Feb 5, 2014 - 04:52 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Skeeve, that is a cut and paste of the .lss file.

What else can I post to help?

BTW if I type cast to uchar it gets better

	 vram2[(unsigned char)(eXH<<8)|eXL] = 0x22;
    9e06:	e0 e0       	ldi	r30, 0x00	; 0
    9e08:	fc e0       	ldi	r31, 0x0C	; 12
    9e0a:	e0 70       	andi	r30, 0x00	; 0
    9e0c:	e8 0f       	add	r30, r24
    9e0e:	f1 1d       	adc	r31, r1
    9e10:	82 e2       	ldi	r24, 0x22	; 34
    9e12:	80 83       	st	Z, r24
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

OK found the -S switch

main:
.LFB2:
.LSM0:
	push r16
/* prologue: function */
/* frame size = 0 */
.LSM1:
	in r24,35-32
.LVL0:
.LSM2:
	in r25,38-32
.LVL1:
.LSM3:
	ldi r30,lo8(vram)
	ldi r31,hi8(vram)
	andi r30,lo8(-256)
	add r30,r24
	adc r31,__zero_reg__
	ldi r24,lo8(34)
.LVL2:
	st Z,r24
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

joeymorin wrote:
SprinterSB wrote:
or badly hack GCC
Type-punning and the like? Yeah, best avoided.

Any other hacking you can think of?

I meant hacking the GCC sources. Just the fact that an object is aligned does not justify to skip the high bits of address computation.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
I meant hacking the GCC sources.

Dear lord that sounds drastic.

Also I installed 4.8.2 from the MHV tools set. No joy there

I think we put this one in the ASM is marginally ahead basket.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Not real relevant for the last part of this but.
I downloaded studio 6.1 sunday, (upto now I have used a winAVR 10.10.10 or something like that), and the code it generate use a MUL by 8 for the <<3 (both for -OS and -O2) how do you get it to make "your" output code ?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:
Perhaps casting eXH<<8 to 8 bits would help.
Oops.
Casting to 8 bits makes it 0.
code semicolon comments by skeeve, andrewm1973 wrote:
OK found the -S switch

main:
.LFB2:
.LSM0:
	push r16
/* prologue: function */
/* frame size = 0 */
.LSM1:
	in r24,35-32    ; could have been r30
.LVL0:
.LSM2:
	in r25,38-32
.LVL1:
.LSM3:
	ldi r30,lo8(vram)  ; what I was looking for
	ldi r31,hi8(vram)
	andi r30,lo8(-256)
	add r30,r24   ; could have been a mov or nothing
	adc r31,__zero_reg__  ; could have been nothing
	ldi r24,lo8(34)
.LVL2:
	st Z,r24

BTW I fairly easily get 6 instructions and 7 cycles for the whole works without casting away eXH.
    in r30, PINB
    in r31, PINC
    andi r31, 3
    addi r31, hi8(vram)
    ldi r24, 0x22
    st Z, r24

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
Not real relevant for the last part of this but.
I downloaded studio 6.1 sunday, (upto now I have used a winAVR 10.10.10 or something like that), and the code it generate use a MUL by 8 for the <<3 (both for -OS and -O2) how do you get it to make "your" output code ?
It wasn't on AS (I use the plain old avr-gcc toolchain from Atmel with 4.8.1) but I saw that behaviour as well.

However, it was in a specific circumstance. I had compiled and linked (a test program with a skeletal main() which called SetRamTile()) in one step, but had not used -ffunction-sections -Wl,--gc-sections.

SetRamTile() was present as a callable function with a ret, but wasn't ever called. It was also inlined into main. The dead function contained the mul, but the live in-lined code used the 3 lsl. Don't know why. Didn't dig deeper.

Curiously, the mul approach has the same flash word cost, but a higher (by 1) cycle cost and greater register pressure since the results end up in r0/r1.

I expect there is an optimisation rule that catches this kind of thing, but not for dead code (why would it?)

JJ

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

R0/R1 does not add up to the register pressure because these registers are used implicitly only.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
I meant hacking the GCC sources
Yeesh.
Quote:
Just the fact that an object is aligned does not justify to skip the high bits of address computation.
I was thinking (inadvisably) of:
  unsigned char *vram2=(unsigned char*)(0xFF00 & (int)vram);
  uint16_t eX;

  *((uint8_t*)(&eX)+0) = PINB; // low
  *((uint8_t*)(&eX)+1) = PINC; // high

  vram2[eX] = 0x22; 

But I don't know how to keep the stack frame from getting involved:

int main (void) {
  90:	cf 93       	push	r28
  92:	df 93       	push	r29
  94:	00 d0       	rcall	.+0      	; 0x96 
  96:	cd b7       	in	r28, 0x3d	; 61
  98:	de b7       	in	r29, 0x3e	; 62

  unsigned char *vram2=(unsigned char*)(0xFF00 & (int)vram);
  uint16_t eX;

  *((uint8_t*)(&eX)+0) = PINB; // low
  9a:	83 b1       	in	r24, 0x03	; 3
  9c:	89 83       	std	Y+1, r24	; 0x01
  *((uint8_t*)(&eX)+1) = PINC; // high
  9e:	86 b1       	in	r24, 0x06	; 6
  a0:	8a 83       	std	Y+2, r24	; 0x02

uint8_t vram[256] __attribute__ ((__aligned__ (1024))) ;

int main (void) {

  unsigned char *vram2=(unsigned char*)(0xFF00 & (int)vram);
  a2:	e0 e0       	ldi	r30, 0x00	; 0
  a4:	f4 e0       	ldi	r31, 0x04	; 4
  a6:	ee 27       	eor	r30, r30
  uint16_t eX;

  *((uint8_t*)(&eX)+0) = PINB; // low
  *((uint8_t*)(&eX)+1) = PINC; // high

  vram2[eX] = 0x22; 
  a8:	89 81       	ldd	r24, Y+1	; 0x01
  aa:	9a 81       	ldd	r25, Y+2	; 0x02
  ac:	e8 0f       	add	r30, r24
  ae:	f9 1f       	adc	r31, r25
  b0:	82 e2       	ldi	r24, 0x22	; 34
  b2:	80 83       	st	Z, r24

}
  b4:	0f 90       	pop	r0
  b6:	0f 90       	pop	r0
  b8:	df 91       	pop	r29
  ba:	cf 91       	pop	r28

Yeesh.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Wed. Feb 5, 2014 - 05:44 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
R0/R1 does not add up to the register pressure because these registers are used implicitly only.
Does it not require that r1 be zeroed again? I suppose that's not 'register pressure' though...

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Correct. "Register pressure" would mean "take away the register(s) from other code so it cannot hold values in these registers".

The need to clear R1 adds to consumed time and code size.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

One might try replacing vram with ((unsigned char*)0xC00).

In any case, I suspect that SetPixelTile should be to made into a macro.
That would allow speedups through common sub-expression elimination.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Such hacks won't help. GCC does not perform such optimizations.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:
Such hacks won't help. GCC does not perform such optimizations.
Wait... mine or skeeve's? Mine I expect...

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Closest I've got it is using 0xC0. If I mention vram it does the 16bit maths.

	unsigned char eXL = PINB;     // IN something from PINB and PINC
	unsigned char eXH = PINC;     // so the compiler doesn't optimize

	 eXH &= 3;
	 eXH |= 0x0C;

	 unsigned char *vram2 = (unsigned char *)((eXH<<8)|(eXL));
	 vram2[0] = 0x22;

Now if it could just not load into R24 and then move it into R31.

	unsigned char eXL = PINB;     // IN something from PINB and PINC
    9900:	e3 b1       	in	r30, 0x03	; 3
	unsigned char eXH = PINC;     // so the compiler doesn't optimize
    9902:	86 b1       	in	r24, 0x06	; 6

	 eXH &= 3;
    9904:	83 70       	andi	r24, 0x03	; 3
	 eXH |= 0x0C;
    9906:	8c 60       	ori	r24, 0x0C	; 12

	 unsigned char *vram2 = (unsigned char *)((eXH<<8)|(eXL));
    9908:	f0 e0       	ldi	r31, 0x00	; 0
    990a:	f8 2b       	or	r31, r24
	 vram2[0] = 0x22;
    990c:	82 e2       	ldi	r24, 0x22	; 34
    990e:	80 83       	st	Z, r24
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Mind the implicit integer promotions. Does -mint8 help?

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

-mint8 does not stop the r24 -> r31 waste.

It does however seem to break just about everything in the uzebox kernel so prob would not have been an option even if it helped.

Now that I have 4.8.2 set up - it seems to do better in some other things than 4.7.x

I will go through my list of things I did in ASM now and try pick out the next simplest/smallest example I would like to learn how to make C do as well or get close.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I've got the candles, I've got the chalk pentagram.  Let the necromancy begin.

 

Howdy all, I've been gone over a year doing stuff.  I'm back to optimizing C code again now.  My next one is about multiplying an 8 bit unsigned number with an 8 bit signed number and then right shifting the result by 8 bits (its fixed point).

 

	xAdd = (Z * xAdd)>>8;

The code in C works.  The little aliens fly around the screen exactly as I expected my fixed point maths would make them.  It however makes this code

 

int8_t  xAdd;
uint8_t Z;
.
.
.
xAdd = (Z * xAdd)>>8;
    mov	    r18, r20
    ldi	    r19, 0x00	; 0
    adiw    r24, 0x01	; 1
    add	    r24, r24
    adc	    r25, r25
    add	    r30, r24
    adc	    r31, r25
    ld	    r24, Z
    eor	    r25, r25
    sbrc    r24, 7
    com	    r25
    mul	    r18, r24
    movw    r20, r0
    mul	    r18, r25
    add	    r21, r0
    mul     r19, r24
    add	    r21, r0
    eor	    r1, r1
    mov	    r20, r21
    add	    r21, r21
    sbc	    r21, r21

I was doing this in a .S file but just yesterday I learnt how to do inline ASM (to save some MOV/RCALL/RETs) so will show my version here in that form

 

asm(   "mulsu	%r0, %r1 \n\t"
       "mov	%r0, r1\n\t"
       "clr	r1"
       : "+a" (xAdd)
       : "a" (Z)
);

Which predictably looks like this in the LSS file

 

asm(	"mulsu	%r0, %r2 \n\t"

    mulsu   r18, r23
    mov	    r18, r1
    eor     r1, r1

	"clr	r1"
	: "+a" (yAdd)
	: "a" (Z)

So - what am I doing wrong.  How do I hint to the compiler to make my (signed 8:0) * (unsigned 0:8) be close to 4 clock cycles ?  I know I am a pretty average C coder, but I am trying to improve that so I have to resort to ASM less often.
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Have you tried using any optimization level besides -O0? With this test program:

 

extern int8_t x;
extern uint8_t y;
int8_t mul(void)
{
    return (x * y) >> 8;
}

and compiled with -O1 I get this assembly output:

  10                    mul:
  11                    /* prologue: function */
  12                    /* frame size = 0 */
  13                    /* stack size = 0 */
  14                    .L__stack_usage = 0
  15 0000 3091 0000             lds r19,x
  16 0004 2091 0000             lds r18,y
  17 0008 3203                  mulsu r19,r18
  18 000a C001                  movw r24,r0
  19 000c 1124                  clr __zero_reg__
  20 000e 892F                  mov r24,r25
  21 0010 0895                  ret

Curiously, -O2, -O3, and -Os swap r18 and r19, but it ends up with the same results.

 

This is with avr-gcc version 4.8.2.

 

I say avr-gcc does a decent job with this task!

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The code I pasted above was with 4.8.2 with -Os.

 

Yes it does do a semi-reasonable effort if it is in a subroutine and operating on globals. That is 20 clocks when you include the RCALL, RET and the two ST to get the local/automatics into a global RAM address. So thats 4 clocks faster than just

 

result = (scale * size)>>8;

 

So so far I have

 

24 clocks = C code operating on variables already in register
20 clocks = C code calling a subroutine working on globals (include call/ret/shuffling)
15 clocks = ASM in a .S file being called as a subroutine (include call/ret/shuffling)
4 clocks = inline ASM using the registers the variables are already in

 

Is there not a way to get the compiler closer?  When I do a subroutine as Christop showed I get the same result as he did.  When I do it inline it blows out.

 

Joey Morin and others gave me some good pointers in the previous examples I had. In this one apart from it being a return() value straight away, I can not work out how to get the compiler to not promote the whole thing to 16 bits.

 

I guess the hint to look at is why does

 

return(X * Y) >> 8;

 

spot that the result is 8 bits in r1 but

 

result = (X * Y) >> 8;

 

tries to work on 16 bit values ?
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If it helps here is the code I am working on for this.  The MulSU function is called by around 10 different routines with different variables - not just here.

 

void moveXYCommon(ObjectDescStruct *objData){

	uint8_t Z;
	uint8_t lane;

	Z = pgm_read_byte(&zTable[objData->z]);					// Perspective transform the objects Z value

	objData->s = pgm_read_byte(&scaleTable[objData->z]);	// Get the scale of the object at given Z value

	lane = objData->laneHi;

	int8_t x = laneData[lane].xStart;						// Get the X/Y start position of the current lane
	int8_t y = laneData[lane].yStart;

	int8_t xAdd = laneData[lane].xGrad;						// Get the gradient of the lane
	int8_t yAdd = laneData[lane].yGrad;

	xAdd = (Z * xAdd)>>8;									// Work out how far the object is from the start of the lane
	yAdd = (Z * yAdd)>>8;

	//xAdd = MulSU(Z, xAdd);
	//yAdd = MulSU(Z, yAdd);

	//MulSU2(Z, xAdd, yAdd);

	objData->x = x + xAdd;									// Add the delta X/Y to the lane start position
	objData->y = y + yAdd;
}

 

The commented out lines are the equivelant ASM routine from a .S file and the below #define

 

#define MulSU1(Z, a) \
		asm(	"mulsu	%r0, %r1 \n\t"	\
				"mov	%r0, r1\n\t"	\
				"clr	r1"				\
				: "+a" (a)				\
				: "a" (Z)				\
		);

#define MulSU2(Z, a, b) \
		asm(	"mulsu	%r0, %r2 \n\t"	\
				"mov	%r0, r1\n\t"	\
				"mulsu	%r1, %r2 \n\t"	\
				"mov	%r1, r1\n\t"	\
				"clr	r1"				\
				: "+a" (a)				\
				, "+a" (b)				\
				: "a" (Z)				\
		);

 

As I said - I only just worked out how to do inline ASM yesterday, but it seems to be useful.

 

Maybe I have done something dumb in one of the lines above the multiplication?  I've tried a few different things to see if I can help it.  Splitting the (Z * xAdd) and the >>8 up only made it worse.
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Before using time on a other way, this question, does the compiler make better code if both are unsigned ?(Then the mul can be from any register, so perhaps it does it different).

 

And a side question to the experts are there a way to inline an ASM function?  

Pages