GCC - sub optimal register usage

Go To Last Post
28 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi all,

I may be doing something dumb here as I am new to C on the AVR. I am using the latest AS6 and the GCC that comes standard.

I have reduced the code to shown below and also have the ASM output. (the OUTing to portb is to prevent it all being optimized away)

Why the useless extra MOV to and from R24 (address 51 and 55)? I googled for a few hours and I only come up with a post from 2005 asking if it is a bug or URLs pointing to non-gnu dot org that just time out on me.

#include 

register uint8_t i asm("r2");
register uint8_t j asm("r3");

int main(void)
{
i = 0;
j = 100;

while(1){
	while(i < j){
		i++;
		PORTB = 0x01;
		PORTB = 0x02;
	}
}
}
	i = 0;
00000049  CLR R2		Clear Register 
	j = 100;
0000004A  MOV R0,R31		Copy register 
0000004B  LDI R31,0x64		Load immediate 
0000004C  MOV R3,R31		Copy register 
0000004D  MOV R31,R0		Copy register 
			PORTB = 0x01;
0000004E  LDI R18,0x01		Load immediate 
			PORTB = 0x02;
0000004F  LDI R25,0x02		Load immediate 
00000050  RJMP PC+0x0005		Relative jump 
			i++;
00000051  MOV R2,R24		Copy register 
00000052  INC R2		Increment 
			PORTB = 0x01;
00000053  OUT 0x18,R18		Out to I/O location 
			PORTB = 0x02;
00000054  OUT 0x18,R25		Out to I/O location 
		while(i < j)
00000055  MOV R24,R2		Copy register 
00000056  CP R2,R3		Compare 
00000057  BRCS PC-0x06		Branch if carry set 
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

What is the optimization setting you use?

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Does similar behavior on O0..O3 and Os.

O0 does some ludicrous things around LPM instructions. But they all seem to needlessly MOV from a bound-register to R24 and back again.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It produces significantly better code without dedicating r2 and r3 to the purpose; I would suspect that the optimizer isn't handling the asm-assigned registers as well as it might. (moving quantities into r24 for the purpose of doing calculations, discovering that it doesn't actually need to have done so, and failing to remove the moves. Or something like that.)

I assume you have a better reason for binding the registers than shown in your example, though :-(

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
I assume you have a better reason for binding the registers than shown in your example, though

It was me being hopeful to speed up some line drawing code and keeping X0, Y0, X1 and Y1 in registers.

I was doing profiling and the non-bound-register version of my "scene-draw" routine with a test scene was taking 5.6uS. When I bound the X0,Y0,X1,Y1 in registers it actually slowed down to 5.7uS so I went looking at the ASM.

I am a long term ASM coder and this is my first serious foray into C on AVR.

I don't need huge performance increases to meet my "target" but it looks like I will have to do the "scene-draw" and down in ASM. (I already have done it all in stand alone ASM so know how fast if COULD be)

I realise that binding up all the registers makes it hard work for the compiler. I was however only using R2..R7 which according to the Avr Lib C FAQ should not be an issue.

Even the way the LPM instruction is being used to read ProgMem is a bit braindead and I could write ASM to be at least 2x as quick.

I don't know if I am not describing the problem well to the compiler or if the compiler is just not as smart as I expected.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Compiling on AVR-GCC 4.7.2 using -Os yields:

.global main
        .type   main, @function
main:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
        mov r2,__zero_reg__
        ldi r25,lo8(1)
        ldi r18,lo8(2)
        rjmp .L2
.L3:
        mov r2,r24
        inc r2
        out 0x5,r25
        out 0x5,r18
.L2:
        mov r24,r2
        cpi r24,lo8(100)
        brlo .L3
.L5:
        rjmp .L5
        .size   main, .-main
        .ident  "GCC: (GNU) 4.7.2"

Turning variables i and j into normal (auto) variables gives:

.global main
        .type   main, @function
main:
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
        ldi r24,lo8(101)
        ldi r25,lo8(1)
        ldi r18,lo8(2)
        rjmp .L2
.L3:
        out 0x5,r25
        out 0x5,r18
.L2:
        subi r24,lo8(-(-1))
        brne .L3
.L4:
        rjmp .L4
        .size   main, .-main
        .ident  "GCC: (GNU) 4.7.2"

Jörg Wunsch

Please don't send me PMs, use email if you want to approach me personally.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thank you yes Jörg.

I saw that when I didn't specify the "register" on the variable it did not do the extra useless MOV instructions.

I would like to know if there is a way to keep my four variables in registers and not have the stupidity happen.

Or is GCC on the AVR just broken in that respect.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just a side note to say that a limitation of AVR is that opcodes such as LDI can only operate on R16..R31 so using R2/R3 then var=100 was always going to involve a round about route to get the 100 into 2/3.

My experience of 4.7.2 so far has been that its code generation model is brilliant and it will almost always produce optimal code anyway. Have you tried just writing C and seeing what it produces. Are there really occasions when it didn't just register cache your x0,y0,x1,y1?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
I would like to know if there is a way to keep my four variables in registers and not have the stupidity happen.
Write your code in assembly.
Quote:
Or is GCC on the AVR just broken in that respect.
No, you broke it by trying to out smart the compiler. Let the compiler do the work.

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

> I would like to know if there is a way to keep my four variables
> in registers and not have the stupidity happen.

See the second GCC 4.7.2 example above: actually, it's just *one*
variable anymore ("i", aka. r25). Variable "j" is a constant, and
encoded as such. The compiler uses some more registers (because its
got enough of them available) to keep some handy constants around.

You might also notice the compiler shortcuts the outer while loop, as
it knows the inner loop will never be executed again after the first
pass (because "i" cannot be less than 100 then).

Jörg Wunsch

Please don't send me PMs, use email if you want to approach me personally.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson,

Thank you for chiming in. I have read 100s of your posts and they are always worthwhile.

Quote:
using R2/R3 then var=100 was always going to involve a round about route to get the 100 into 2/3

Thanks yes. I know LDI only works on R16+. This was just a contrived example to show the weird use of

00000051 MOV R2, R24
...
00000055 MOV R24, R2

The real program is using LPM to get the data into the registers. Was just trying to make it easier for the experts I am asking for help by reducing the code to the smallest possible snippit and still produce the erroneous result.

Quote:
Are there really occasions when it didn't just register cache

GCC did seem to be needlessly shuffling registers at times even before this. The calling functions where across several different files with several different target functions. This may have had something to do with that.

koshchi,

Write your code in assembly.
Quote:

I think I shall have to use my ASM version. I can beat the compiler at -O3 by about 30% here.

Quote:
you broke it by trying to out smart the compiler

Seeing as avr-libc#FAQ tells me I can do it. The "un-registered" code never touched R2..R7 before hand and the fact that MOV R2,R24...MOV R24,R2 is a totally useless construct I am going to call "broken".

AFAIK I'm not tying up registers it needs elsewhere. I am not forcing it to use registers that don't work with the needed opcodes.

In fact if I take the C generated code with the bound registers and the silly MOV-R24. Then I remove the stupid MOV instructions it DOES beat the compiler with -O3 by a small margin. So there must be some value in the "register" compiler directive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

And how is it going to react when you put in real code instead of a contrived example?

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
And how is it going to react when you put in real code instead of a contrived example?

Well obviously it reacted badly to the real code in the first place. That is why I made the contrived example to show the poor form.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well the interwebs finally worked well enough for me to see bug #33050.

It does not look the same. So unless I am missing some keyword or compiler switch I guess this is a new failed optimization case.

BTW I am still very open to the idea I am doing something wrong here. The way I am telling the compiler use the registers, some keyword I am missing.

I don't accept that trying to suggest keeping a few globals in registers is the wrong thing to do. I know a lot more about the program behaviour than the compiler ever can. I know what routines get called 1000s of times and where a saving of 1 clock might count.

So in the end I am going to do the whole thing from "render-scene" down in ASM. I will be able to then get a fair bit more performance with different entry/exit points on line/line_to and all those other things that used to be important before ati and nvidea.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
I am still very open to the idea I am doing something wrong here.

Speaking as a technically interested Nth party, I'd be more interested in looking at the code your were originally trying to optimize (something with LPM), rather than debugging the compiler optimization of a relatively obscure feature of questionable merit.

AFAIK, the "register asm" feature is intended to do things like reserve some registers for use by "naked" interrupt service routines and such. Failure to mesh well with normal code isn't very surprising, nor interesting. OTOH, compilers are pretty much supposed to be beyond the point where user-level hints to put variables in registers are useful; the compiler do its own "really good" job of register allocation, and examples of when it doesn't ARE interesting...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

westfw,

It does a pretty bloody good job all by itself without me hinting at it. Thumbs up to all the people that have worked hard on it.

I tried the "register" thing to see if that would help it get closer in performance to my pure ASM code. It was when it got worse after that I went looking at the compiled code and found the issue.

Maybe I am a heretic that should be chased out of C land for doubting the compiler. I do however fail to see how

055 MOV R24,  R2   ; Save a copy of R2
056 CP   R2,  R3   ; Compare R2 and R3 which does not change R2
057 BRCS PC-0x06   ; Branch back to 000051 if Carry Set
051 MOV  R2, R24   ; Copy R2 back even though R2 was never touched

is not broken making code 2x longer than needed.

Although my example is not very interesting it is the same seemingly faulty output the real code produces.

The end point of the real code is 2 Bresenham line routines and one plot_pixel. One line has 2 points and one is "Line_To" and therefore needs to have a global for the first point.

It was the ST and LD that I thought "register" would help get rid of. It DID get rid of them but then replaced them with pointless MOVs.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

> Although my example is not very interesting it is the same seemingly
> faulty output the real code produces.

Then stop beating on an obsolete compiler version. See above, GCC 4.7.2
(the currently most recent release) doesn't produce the code you don't
stop whining about here. No compiler developer can change old releases
anymore.

Show us your actual code, and compile it with a recent compiler.

Jörg Wunsch

Please don't send me PMs, use email if you want to approach me personally.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sorry I didn't realise the version I downloaded from Atmel a few weeks ago was so out of date.

I now have 4.7.2 and it still exhibits the same behaviour.

#include 

register uint8_t gx0 asm("r2");
register uint8_t gy0 asm("r3");

void DumbMove()
{
	while(gx0 < gy0)
	{
		gx0++;
		PORTC = 0x01;
		PORTC = 0x00;
	}
}

int main(void)
{
    while(1)
    {
		gx0 = PINB;
		gy0 = PIND;
		DumbMove();
    }
}

It uses IN to a high register then MOVs it.

It MOVs to a high reg to CP and then MOVs back.

If you really want to see the real code then I will email it to you. I don't wish to reduce the signal to noise ratio here even further.

Everywhere else on the internet seems to want to see the minimum code to reproduce a bug rather than waste time with 30 pages of code when a dozen lines will suffice.

I assume you realise why I want globals with Line/LineTo. Especially with the nature of the application such that LineTo may not even know where the line is from depending on external stimulus.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

> It uses IN to a high register then MOVs it.

How else do you think the data would be able to go into your variables
gx0 and gy0? As Cliff already pointed out, only registers 16 and above
are possible targets for an IN statement. So in order to go into your
gx0 and gy0 variables, there is no other option but loading the data
into a temporary register via IN, and then move it there.

Of course, it won't be better if you turn your gx0 and gy0 back into
normal global variables, it's just replacing the MOV by an STS then.
Again, there's no way to avoid this.

The only way to avoid the separate load and move would be to have
gx0 and gy0 as local variables.

I'm sure your real code is completely different though, but you don't
want to show us your actual problem. It's pointless to discuss it on
some artificial piece of code that doesn't match the actual one.

Jörg Wunsch

Please don't send me PMs, use email if you want to approach me personally.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

dl8dtl,

OK thank you for that. I at least now know I can _ignore_ your advice :)

IN can work on R0..31
MOV can also

it is

LDI that can only work on R16..31

You seem unable to read english as I have never complained about an LDI using a temporary hi-reg.

Would someone with a clue like to chime in?

Not for my sake. I have already decided to go with ASM. However if someone with expertise can say it is an optimization problem I will try report a bug to the AVR-GCC people. (who I am sure actually ask you to make the minimum code that can reproduce the problem)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Jörg

Sorry but that's LDI with the 16-31 restriction, IN works on all 32 registers.

The "tightest" version I can get for OPs code using the 4.7 that is in 6.1Beta is this:

void DumbMove(uint8_t gx0, uint8_t gy0)
{
   while(gx0 < gy0)
   {
      gx0++;
      PORTC = 0x01;
      PORTC = 0x00;
   }
}

int main(void)
{
    while(1)
    {
      DumbMove(PINB, PIND);
    }
}

which is probably what a C programmer would have written in the first place rather than trying to keep gx0/gy0 global in any way. The code produced is:

DumbMove:
//==> {
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
//==>       PORTC = 0x01;
	ldi r25,lo8(1)	 ;  tmp52,
//==>    while(gx0 < gy0)
	rjmp .L2	 ; 
.L3:
//==>       gx0++;
	subi r24,lo8(-(1))	 ;  gx0,
//==>       PORTC = 0x01;
	out 0x15,r25	 ;  MEM[(volatile uint8_t *)53B], tmp52
//==>       PORTC = 0x00;
	out 0x15,__zero_reg__	 ;  MEM[(volatile uint8_t *)53B],
.L2:
//==>    while(gx0 < gy0)
	cp r24,r22	 ;  gx0, gy0
	brlo .L3	 ; ,
/* epilogue start */
//==> }
	ret
	.size	DumbMove, .-DumbMove

	.section	.text.startup.main,"ax",@progbits
.global	main
	.type	main, @function
main:
//==> {
/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
.L5:
//==>       DumbMove(PINB, PIND);
	in r22,0x10	 ;  D.1365, MEM[(volatile uint8_t *)48B]
	in r24,0x16	 ;  D.1367, MEM[(volatile uint8_t *)54B]
	call DumbMove	 ; 
	rjmp .L5	 ; 
	.size	main, .-main

Cliff

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Cliff,

Thanks. Yes the compiler does an excellent job without me using the global/register restriction. In fact it showed me up in one case where I didn't pick up that the hi/lo byte of a 16bit constant both evaluated to 0x30 so it did not need a 2nd LDI.

However as stated above due to the nature of the project I kinda need a global for x0 and y0 in some line drawing routines.

I did try make a struct to contain x0,y0,x1,y1 to pass in and this helped keep all the different .C and .H files on the same page. That tactic fell over when "LineTo" had to be called from a different path. From that point on I needed x0 and y0 as globals.

With x0,y0 as globals without the "register" keyword it still does a good job of handling the data. It does keep them in RAM though and Load/Store them which is too slow considering the frequency the routines are called.

The overall big picture is O*n. So getting O as quick as possible is what I want. The more "n" I can get into that O*n the prettier the effects are on the screen.

Last Edited: Thu. Mar 21, 2013 - 11:18 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Oh - I should add again. Before I even downloaded AS6 and started to play with C I had completed the whole scene drawing routine in ASM.

I do want to release this to the public so wanted at least the "user" part in C to make it easy for people. The "firmware" part looks like it is now going to be ASM as I don't think C is going to be quick enough.

I would never try compete with GCC doing something complex on a superscaler CPU with 10+ stage pipelines, out-of-order-execution, branch prediction, SIMD and optional-condition-codes. On the AVR though I still think a human with ASM is going to win.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

> Sorry but that's LDI with the 16-31 restriction, IN works on all 32 registers.

Yeah, OK, I should have re-read that part in the docs. ;-)

Maybe Johann could tell us why the compiler inserts a temporary
register here. But then, binding variables to registers surely
belongs to the less exposed (and thus less optimized) paths of the
compiler. I wouldn't be surprised if it initially matches to
operation to a pattern like "load IO port to variable", then allocates
the temporary register (for the memory store operation), and only
afterwards notices the variable lives in a register rather than
memory.

Jörg Wunsch

Please don't send me PMs, use email if you want to approach me personally.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

While an interesting discussion, I don't see the critical nature of optimal compiler performance.

Each compiler is going to have its own model of code generation. Sometimes it fits a low-level problem; sometimes it might not. Another compiler with a different model of code generation and register usage might fare better--in this particular case. Or worse.

Anyway as this is a critical bitblp or similar fragment, and performance is critical, and it will be done a zillion times--then hand optimize that piece.

Easier said than done, perhaps, as OP is new to AVR. But given the overall routine in C (to get stack frames and the like), and given the compiler output, I'm sure all the responding gurus could help out the AVR/GCC newbie to either do the critical fragment/loop in an ASM fragment, or if part of a small routine then to do the whole routine as an ASM source file.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I agree that a missed optimization ( while important to take note of and add to the pipeline for consideration ) is not nearly as critical as a code generation bug. It does, however, appear to be a missed optimization. As such, it makes sense to check if it has been flagged in bugzilla and, if not make a report.

Once done, it seems reasonable ( given that am asm version is already written ) to go the assembly route. Keeping in mind the avr-gcc ABI, there is no reason that the function cannot integrate like any other C function as far as the user is concerned.

Martin Jay McKee

As with most things in engineering, the answer is an unabashed, "It depends."

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I am hardly new to either the AVR or C, just new to C on the AVR.

It is my familiarity with the AVR that made the sub-optimal code stand out.

Anyway, thank you both for agreeing that it is a bug/missed-optimization.

I had done a bit of checking prior to my first post and none of the bugzilla tickets I saw match this case.

I have read all the mixing ASM and C things I could find and think I should be OK there. If not I know where to come and ask for help.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Update for anyone interested.

GCC 4.7.2 is much better once inside the tight loops on the real code.

It actually only does the wasted MOV once at entry rather than on every loop decision.

My test scene times

5.6uS GCC 3.x.x with auto vars and RAM global
5.7uS GCC 3.x.x with register vars

5.4uS GCC 4.7.2 with auto vars and RAM global
4.9uS GCC 4.7.2 with register vars

for reference my ASM version is 3.9uS