AVR-gcc de-optimizing my code

Go To Last Post
33 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I've found that avr-gcc automatically generates loop counters even when I don't want it to.

 

Check out this example code that counts how many pins in PORT D are set (this is just an example to illustrate the problem). I'm using Arduino IDE.

 


int main() {
	byte mask=1, count=0;
	do
		if (PIND & mask) count++;
	while (mask <<=1);
	return count;
}

This loop executes 8 times, because the mask variable will eventually overflow its 8 bit size and become zero, stopping the loop. Now check out the assembly output:

 

00000080 <main>:

int main() {
  80:   28 e0           ldi     r18, 0x08       ; 8
  82:   30 e0           ldi     r19, 0x00       ; 0
        byte mask=1, count=0;
  84:   80 e0           ldi     r24, 0x00       ; 0
  86:   91 e0           ldi     r25, 0x01       ; 1
        do {
                if (PIND & mask) count++;
  88:   49 b1           in      r20, 0x09       ; 9
  8a:   49 23           and     r20, r25
  8c:   09 f0           breq    .+2             ; 0x90 <main+0x10>
  8e:   8f 5f           subi    r24, 0xFF       ; 255

int main() {
        byte mask=1, count=0;
        do {
  90:   99 0f           add     r25, r25
  92:   21 50           subi    r18, 0x01       ; 1
  94:   31 09           sbc     r19, r1
  96:   c1 f7           brne    .-16            ; 0x88 <main+0x8>
                if (PIND & mask) count++;
        } while (mask <<=1);
        return count;
}
  98:   90 e0           ldi     r25, 0x00       ; 0
  9a:   08 95           ret

The compiler noticed that the loop executes 8 times, so it decided to create a counter of its own accord, worse, it creates a 16 bit counter to store a value of 8 using r18 and r19 no and then decrements it inside the loop (addr 92 and 94). The compiler at least could have been smart enough to see that 8 could fit in 8 bits, after all the ATmega is a 8 bit processor and 16 bit operations have extra costs.

 

If I disable this "optimization" feature of gcc using the option "--param max-iterations-to-track=1", the code I was expecting is generated:

 

00000080 <main>:

int main() {
        byte mask=1, count=0;
  80:   80 e0           ldi     r24, 0x00       ; 0
  82:   91 e0           ldi     r25, 0x01       ; 1
        do {
                if (PIND & mask) count++;
  84:   29 b1           in      r18, 0x09       ; 9
  86:   29 23           and     r18, r25
  88:   09 f0           breq    .+2             ; 0x8c <main+0xc>
  8a:   8f 5f           subi    r24, 0xFF       ; 255

int main() {
        byte mask=1, count=0;
        do {
  8c:   99 0f           add     r25, r25
  8e:   d1 f7           brne    .-12            ; 0x84 <main+0x4>
                if (PIND & mask) count++;
        } while (mask <<=1);
        return count;
}
  90:   90 e0           ldi     r25, 0x00       ; 0
  92:   08 95           ret

So the code becomes faster and smaller disabling this optimization, I hope there aren't many like this inside gcc. Should I warn avr-gcc developers of this issue? I mean, it's not really a bug, just non-optimal compiling.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Your posting in the wrong place. Either raise a bug in the bugzilla on Savannah or post to the avr-gcc developer list. Few,  if any,  read here. 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ok.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

First which version of the compiler do you use , and what is the optimizing level.(-O3 will for sure not make a loop! but have other problems).

 

And don't expect a compiler to make optimal code, and some of the newer versions of the compiler make worse AVR code, I guess it's because the frontend has become more intelligent in the view of ARM and other "bigger" CPU's, and the backend for an AVR get's into problems.  

 

That the extra counter is 16bit and not 8 is probably because an int is 16 bit.

 

If you want "optimal" code for all situations you have to use ASM.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

 

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

 

 

 

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

My guess is that the loop counter code is CPU-independent,

hence the artificial counter is an int.

On CPUs the GNU guys care about,

int operations are single instructions on single registers.

 

That said, introducing an artificial counter at all would be a bad idea on a PC.

mask has to be processed regardless.

The resulting flag might as well be used.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I wonder if there isn't a specific optimisation which could be tickled via a command line switch or two, to tweak this behaviour.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Joey, he said there is so I don't entirely understand why he thinks there's a problem. If you can get the compiler to generate what you want then surely that's OK? 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
Try -fno-tree-loop-optimize
   
   
   

 

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Joey, he said there is so I don't entirely understand why he thinks there's a problem. If you can get the compiler to generate what you want then surely that's OK? 

Yes, but max-iterations-to-track=1 is a bit of a sledgehammer to crack a nut in this case.

 

Try -fno-tree-loop-optimize

More what I imagined.  Maybe a bit more targeted?  Also, old ground:

https://www.avrfreaks.net/forum/weird-optimized-code-due-ftree-loop-optimize

 

In either case, I suggest using the optimize function attribute to constrain the scope of the 'de'-optimisation.

 

EDIT: corrected quoted text

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Sat. Jul 16, 2016 - 09:03 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks guys, yep, -fno-tree-loop-optimize also works. Btw, optimization level is -Os and gcc version is 4.8.1

 

Also, I didn't know I could set optimization options on a per function basis, so thank you also for this info yes

 

So I'll say my problem is solved with a code like this:

 


int main() __attribute__ ((optimize("no-tree-loop-optimize")));

int main() {
	byte mask=1, count=0;
	do
		if (PIND & mask) count++;
	while (mask <<=1);
	return count;
}

This generates the "proper" code, directly using the flags from the shift (actually, add) instruction for the branch smiley

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

But remember that your code isn't optimal in the first place!

 

1 : You read the port 8 times

2:  If you shift the temp (holding PIND) you don't need the moving mask.

 

Remember that you can make inline ASM if you really need the speed, but in general I don't think you will need the speed.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sure, I understand. This is not my actual code, just a simplified example to illustrate the problem.

 

To do like you said, shifting the temp and adding the carry flag to a counter,  can not be done in C, right? Because you have no direct access to the carry flag from C. Or is there some trick I don't know?

Edit: Anyway, if I don't use the mask, I will need a loop counter that has to be decremented, so it would take about the same time. But the port read should be moved outside the loop, of course.

 

I'm really not very good with C, I'm an asm programmer at heart and even that is just a hobby. It's because I usually program is asm that I was distressed when I found gcc was de-optimizing code cheeky I always hear about how great optimizing compilers are...

Last Edited: Sat. Jul 16, 2016 - 10:20 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The shortest C code to perform this task is below: smiley

 

#include <avr/io.h>

int main()
{
    return __builtin_popcount(PIND);
}

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The shortest non-portable GCC-extension-specific C code ;-)

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There is no way to use C(arry) from C.

but if temp is false you are done.(but perhaps not so easy to utilize in C).
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I've checked the assembly output of that function, it also takes an int argument, so it wastes half its time adding zeroes when the argument is 8 bit sized.

 

And talking about non-portability :-) fully unrolled loop is asm:

 

 int main() {
	int result;
	asm (
		"clr	r24			\n"
		"in	r25, %1			\n"
		"lsr	r25			\n"
		"adc	r24,__zero_reg__	\n"
		"lsr	r25			\n"
		"adc	r24,__zero_reg__	\n"
		"lsr	r25			\n"
		"adc	r24,__zero_reg__	\n"
		"lsr	r25			\n"
		"adc	r24,__zero_reg__	\n"
		"lsr	r25			\n"
		"adc	r24,__zero_reg__	\n"
		"lsr	r25			\n"
		"adc	r24,__zero_reg__	\n"
		"lsr	r25			\n"
		"adc	r24,__zero_reg__	\n"
		"lsr	r25			\n"
		"adc	r24,__zero_reg__	"
   		:"=r" (result)
		:"i" (_SFR_IO_ADDR(PIND)));
	return result;
}

I must stress I'm a gcc noob, so I'm not even sure this code is ok, seems so from the asm output. This inline assembly syntax really is awful, though, but I learned a lot just by trying to write it yes

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If you can suggest that , you haven't seen the code it generates!

 

ok a minute to late :)

Last Edited: Sun. Jul 17, 2016 - 10:54 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have, seems ok. If there is some mistake please point it out to me. Also, what you suggested is:

 

 int main()
{
	int result;
	asm (
		"clr	r24			\n"
		"in	r25, %1			\n"

		"lsr	r25			\n"
		"1:				\n"
		"adc	r24,__zero_reg__	\n"
		"lsr	r25			\n"
		"brne	1b			"

   		:"=r" (result)
		:"i" (_SFR_IO_ADDR(PIND)));
	return result;
}

 

edit: sorry, you were replying to the other guy blush

Last Edited: Sun. Jul 17, 2016 - 10:02 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

yes something like that but I guess that you need an extra adc after the brne!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
If you can suggest that , you haven't seen the code it generates!

 

Well I meant literally the C code one-liner because C is what we write most of; plus I was being tongue-in-cheek.

 

As for portability - I just tried it on X86 it works there also. Microsoft however have their own  __popcnt16  here: https://msdn.microsoft.com/en-us...

 

I have now inspected the code and despite the integer promotion it's actually very good.

Here's the AVR disassembly for 8-bits: I thought there was a bug here because I could see only 7 bits being counted but whoever wrote it knows what they're doing - the final adc r24, r0 counts two bits at once.

 

00000092 <__popcountqi2>:
  92:	08 2e       	mov	r0, r24
  94:	81 70       	andi	r24, 0x01	; 1
  96:	06 94       	lsr	r0
  98:	06 94       	lsr	r0
  9a:	81 1d       	adc	r24, r1
  9c:	06 94       	lsr	r0
  9e:	81 1d       	adc	r24, r1
  a0:	06 94       	lsr	r0
  a2:	81 1d       	adc	r24, r1
  a4:	06 94       	lsr	r0
  a6:	81 1d       	adc	r24, r1
  a8:	06 94       	lsr	r0
  aa:	81 1d       	adc	r24, r1
  ac:	06 94       	lsr	r0
  ae:	80 1d       	adc	r24, r0
  b0:	08 95       	ret

 

I would guess that using __builtin_popcount(PIND); even with the integer promotion is actually similar speed to the El Tangas original code.

 

Last Edited: Sun. Jul 17, 2016 - 11:36 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yeah, that adding the 2 final bits at once is very clever. But it's defeated by the overhead of calls and rets inside the function, if all that could be inlined it would be much better.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

But it's a built-in? There won't be any CALL/RET overhead. 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If it's so time-critical that you can't even afford a call, then why are you using a loop in the first place?

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

> But it's a built-in? There won't be any CALL/RET overhead.

 

If the value is not known at compile time, then avr-gcc uses a transparent libgcc call.

 

 

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It's not really time critical, and not really an optimization problem. I just wanted to show that sometimes avr-gcc generates automatically useless loop counters that just waste time and space for no good reason at all.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
	int result;
	asm (
		"clr	r24			\n"
		"in	r25, %1			\n"
		"lsr	r25			\n"
		"adc	r24,__zero_reg__	\n"
		"lsr	r25			\n"
		"adc	r24,__zero_reg__	\n"
		"lsr	r25			\n"
		"adc	r24,__zero_reg__	\n"
		"lsr	r25			\n"
		"adc	r24,__zero_reg__	\n"
		"lsr	r25			\n"
		"adc	r24,__zero_reg__	\n"
		"lsr	r25			\n"
		"adc	r24,__zero_reg__	\n"
		"lsr	r25			\n"
		"adc	r24,__zero_reg__	\n"
		"lsr	r25			\n"
		"adc	r24,__zero_reg__	"
   		:"=r" (result)
		:"i" (_SFR_IO_ADDR(PIND)));

 

You are hard-coding registers r24 and r25, without placing them in the clobber list.  You are relying on the compiler selecting those registers for result, but that is not a given.  Better to let the compiler select the registers for you, and use those selections.

 

 

Also, as an int, result will be forced to occupy two registers.

 

$ cat popcount.c
#include <avr/io.h>

int main(void) {
  uint8_t result;
  __asm__ __volatile__ (
                        "mov    %[res], __zero_reg__      \n\t"
                        "lsr    %[arg]                    \n\t"
                        "adc    %[res], __zero_reg__      \n\t"
                        "lsr    %[arg]                    \n\t"
                        "adc    %[res], __zero_reg__      \n\t"
                        "lsr    %[arg]                    \n\t"
                        "adc    %[res], __zero_reg__      \n\t"
                        "lsr    %[arg]                    \n\t"
                        "adc    %[res], __zero_reg__      \n\t"
                        "lsr    %[arg]                    \n\t"
                        "adc    %[res], __zero_reg__      \n\t"
                        "lsr    %[arg]                    \n\t"
                        "adc    %[res], __zero_reg__      \n\t"
                        "lsr    %[arg]                    \n\t"
                        "adc    %[res], %[arg]            \n\t"
                      :
                        [res] "=&r" (result)
                      :
                        [arg]   "r" (PIND)
                       );
  return result;
}
$ avr-gcc -Wall -O1 -g -c -mmcu=atmega328p popcount.c -o popcount.elf
$ avr-objdump -S popcount.elf

popcount.elf:     file format elf32-avr

Disassembly of section .text:

00000000 <main>:
                        "lsr    %[arg]                    \n\t"
                        "adc    %[res], %[arg]            \n\t"
                      :
                        [res] "=&r" (result)
                      :
                        [arg] "r" (PIND)
   0:   99 b1           in      r25, 0x09       ; 9
#include <avr/io.h>

int main(void) {
  uint8_t result;
  __asm__ __volatile__ (
   2:   81 2d           mov     r24, r1
   4:   96 95           lsr     r25
   6:   81 1d           adc     r24, r1
   8:   96 95           lsr     r25
   a:   81 1d           adc     r24, r1
   c:   96 95           lsr     r25
   e:   81 1d           adc     r24, r1
  10:   96 95           lsr     r25
  12:   81 1d           adc     r24, r1
  14:   96 95           lsr     r25
  16:   81 1d           adc     r24, r1
  18:   96 95           lsr     r25
  1a:   81 1d           adc     r24, r1
  1c:   96 95           lsr     r25
  1e:   89 1f           adc     r24, r25
                        [res] "=&r" (result)
                      :
                        [arg]   "r" (PIND)
                       );
  return result;
}
  20:   90 e0           ldi     r25, 0x00       ; 0
  22:   08 95           ret

Note that you could use __tmp_reg__ and forgo the need for the second register, but at the cost of a mov.

 

 

 

This inline assembly syntax really is awful

It's a little awkward, but extremely powerful, and not really that difficult to learn.  It is also very well documented.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Sun. Jul 17, 2016 - 02:34 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

But other times it's smart to do it!

Don't expect that that compiler do it the way you write it, as long it will do the same thing it is allowed to rearrange etc.

 

using temp and -O3 it can make this code:

    if (temp & mask) count++;

0000006B  MOV R24,R25        
0000006C  ANDI R24,0x01        
0000006D  SBRC R25,1        
0000006E  SUBI R24,0xFF        
0000006F  SBRC R25,2        
00000070  SUBI R24,0xFF        
00000071  SBRC R25,3        
00000072  SUBI R24,0xFF        
00000073  SBRC R25,4        
00000074  SUBI R24,0xFF        
00000075  SBRC R25,5        
00000076  SUBI R24,0xFF        
00000077  SBRC R25,6        
00000078  SUBI R24,0xFF        
00000079  SBRC R25,7        
0000007A  SUBI R24,0xFF        

 

(can't use the code editor for some reason (one big line))

And that is a natural way on a AVR.

-O2 don't make a loop but -Os does! 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

So the compiler really is quite smart, I need to choose the optimization options carefully, and also study this inline assembly stuff. I have learned much more from this thread than I was expecting, so tnx you all yes

 

However, I still found a new problem cheeky

Using this code (I know it's sub-optimal, it's just for illustration), in the end the compiler still clears r25 for the return value, even though it is inevitably already zero after 8 shifts. Is there a way to explain the compiler r25 does not need to be cleared again?

 

 int main()
{
	uint8_t result;
	asm (
		"clr	%[res]					\n"
		"lsr	%[arg]					\n"
		"adc	%[res],__zero_reg__		\n"
		"lsr	%[arg]					\n"
		"adc	%[res],__zero_reg__		\n"
		"lsr	%[arg]					\n"
		"adc	%[res],__zero_reg__		\n"
		"lsr	%[arg]					\n"
		"adc	%[res],__zero_reg__		\n"
		"lsr	%[arg]					\n"
		"adc	%[res],__zero_reg__		\n"
		"lsr	%[arg]					\n"
		"adc	%[res],__zero_reg__		\n"
		"lsr	%[arg]					\n"
		"adc	%[res],__zero_reg__		\n"
		"lsr	%[arg]					\n"
		"adc	%[res],__zero_reg__	"
   		:[res] "=&r" (result)
		:[arg] "r" (PIND)
		);
	return result;
}

generates:

00000080 <main>:
                "adc    %[res],__zero_reg__             \n"
                "lsr    %[arg]                          \n"
                "adc    %[res],__zero_reg__     "
                :[res] "=&r" (result)
                :[arg] "r" (PIND)
                );
  80:   99 b1           in      r25, 0x09       ; 9
  82:   88 27           eor     r24, r24
  84:   96 95           lsr     r25
  86:   81 1d           adc     r24, r1
  88:   96 95           lsr     r25
  8a:   81 1d           adc     r24, r1
  8c:   96 95           lsr     r25
  8e:   81 1d           adc     r24, r1
  90:   96 95           lsr     r25
  92:   81 1d           adc     r24, r1
  94:   96 95           lsr     r25
  96:   81 1d           adc     r24, r1
  98:   96 95           lsr     r25
  9a:   81 1d           adc     r24, r1
  9c:   96 95           lsr     r25
  9e:   81 1d           adc     r24, r1
  a0:   96 95           lsr     r25
  a2:   81 1d           adc     r24, r1
        return result;
}
  a4:   90 e0           ldi     r25, 0x00       ; 0
  a6:   08 95           ret

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

El Tangas wrote:
However, I still found a new problem cheeky

Using this code (I know it's sub-optimal, it's just for illustration), in the end the compiler still clears r25 for the return value, even though it is inevitably already zero after 8 shifts. Is there a way to explain the compiler r25 does not need to be cleared again?

If it's that big a deal, use assembly.

In the instant case, you have a microcontroller program returning from main.

Not good.

 

Note that "use assembly" can mean editing output from the compiler.

The larger the output, the less likely you are to care about the cycle or the word.

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

First the reason to make sure r25 is 0 is that the function return an int.

 

And the compile don't seems to keep track on the value of R25

(actually I think that it's because for the count part R24 and R25 are two 8 bit variables, but for the return it's an int and the compiler trash the 8 bit in R25 to make the 16 bit return value, and yes in this situation it's not optimal )

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Is there a way to explain the compiler r25 does not need to be cleared again?

It is you who explicitly asked the compiler for a 16-bit return value by declaring your function main() with int.  Were you to declare the function with return type uint8_t, it wouldn't clear r25 because it wouldn't be part of the returned value.

 

As is (16-bit), it must clear the high byte because it has no notion of what you did in the asm statement.  As far as it is concerned, it is a 'black box'.  It can't make any assumptions about side effects of your code contained therein, beyond what you've communicated to it via the inline assembler mechanisms provided.  Namely, output operand list, input operand list, and clobber list.  Apart from that, it performs no analysis of your asm code.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Mon. Jul 18, 2016 - 12:26 AM