avr-gcc creating loop counters out of nothing, for no reason?

Go To Last Post
24 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Over in another forum, ralphd posted about speeding up some SW SPI routines. ( http://nerdralph.blogspot.ca/201... )

 

But to me, the interesting thing is that avr-gcc is doing weird things; creating a (16bit !) loop counter for a loop that doesn't have or need it.

Any ideas why it's doing this?

 

#define SPIPORT PORTB
#define clkpinmask 4
#define mosipinmask 8

void spiWrite(uint8_t data)
{
 uint8_t bit;
 for(bit = 0x80; bit; bit >>= 1) {
  SPIPORT &= ~clkpinmask;
  if(data & bit) SPIPORT |= mosipinmask;
  else SPIPORT &= ~mosipinmask;
  SPIPORT |= clkpinmask;
 }
}

00000000 <spiWrite>:
   0:   28 e0           ldi     r18, 0x08       ; 8   Created loop counter in r19:r18
   2:   30 e0           ldi     r19, 0x00       ; 0
   4:   90 e8           ldi     r25, 0x80       ; 128
   6:   2d 98           cbi     0x05, 5 ; 5
   8:   49 2f           mov     r20, r25
   a:   48 23           and     r20, r24
   c:   11 f0           breq    .+4             ; 0x12
   e:   2c 9a           sbi     0x05, 4 ; 5
  10:   01 c0           rjmp    .+2             ; 0x14
  12:   2c 98           cbi     0x05, 4 ; 5
  14:   2d 9a           sbi     0x05, 5 ; 5
  16:   96 95           lsr     r25
  18:   21 50           subi    r18, 0x01       ; 1   decrement the created loop counter (16bits!)
  1a:   31 09           sbc     r19, r1
  1c:   21 15           cp      r18, r1         ; compare loop counter with zero.
  1e:   31 05           cpc     r19, r1
  20:   91 f7           brne    .-28            ; 0x6
  22:   08 95           ret

 

 

Last Edited: Mon. Mar 30, 2015 - 06:15 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I don't know why it does it that way. But if you add an 8 bit counter you'll get an 8 bit counter (that's still not needed, but at least it's not 16 bits).

    uint8_t bit, i;
    for(i = 0, bit = 0x80; i < 8; i++, bit >>= 1) {
 1d0:   98 e0           ldi     r25, 0x08       ; 8
 1d2:   20 e8           ldi     r18, 0x80       ; 128
 1d4:   2a 98           cbi     0x05, 2 ; 5
 1d6:   32 2f           mov     r19, r18
 1d8:   38 23           and     r19, r24
 1da:   11 f0           breq    .+4             ; 0x1e0 <spiWrite+0x10>
 1dc:   2b 9a           sbi     0x05, 3 ; 5
 1de:   01 c0           rjmp    .+2             ; 0x1e2 <spiWrite+0x12>
 1e0:   2b 98           cbi     0x05, 3 ; 5
 1e2:   2a 9a           sbi     0x05, 2 ; 5
 1e4:   26 95           lsr     r18
 1e6:   91 50           subi    r25, 0x01       ; 1
 1e8:   a9 f7           brne    .-22            ; 0x1d4 <spiWrite+0x4>
 1ea:   08 95           ret

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

'for' is generally a "counted" loop - what happens if you make it a 'while' instead...?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

if you want to make sure write in ASM, perhaps you find a way where it don't make a counter, but perhaps the compile do again in a newer version, you change code or variables (change the register use).

But if you need speed why not just unroll. (my guess is that -O3 will do that)  

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There are thread(s) where one of the 'Freaks demonstrated a faster SPI than the approach shown.  I can never remember the 'Freak (skeeve?) and always have a hard time finding the thread(s).  The approach is to "pre-process" the SPI data with an XOR mask, and then (especially if unrolled) each bit is constant-time of a few cycles, and SBI to the PIN used if needed to change the state of the output pin.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:

There are thread(s) where one of the 'Freaks demonstrated a faster SPI than the approach shown.  I can never remember the 'Freak (skeeve?) and always have a hard time finding the thread(s).  The approach is to "pre-process" the SPI data with an XOR mask, and then (especially if unrolled) each bit is constant-time of a few cycles, and SBI to the PIN used if needed to change the state of the output pin.

Use a skip followed by an out.  To ensure constant-time, the potentially skipped instruction needs to be single-cycle.

International Theophysical Year seems to have been forgotten..
Anyone remember the song Jukebox Band?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

what happens if you make it a 'while' instead...?

 the same thing, actually.  Nearly identical code  (and yes, that's even stranger...)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

even a do while make an counter

and even with goto! 

so the compiler think it's smart that it can find out that the loop count are fixed :(

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ok the best output code I could find was to use a global variable that always are 0x80 (and again next compiler version are perhaps so clever that it know that it can be done with a loop counter :( )

 

  be:	28 2f       	mov	r18, r24
  c0:	90 91 00 01 	lds	r25, 0x0100
  c4:	99 23       	and	r25, r25
  c6:	29 f4       	brne	.+10     	; 0xd2 <test+0x14>
  c8:	0c c0       	rjmp	.+24     	; 0xe2 <test+0x24>
  ca:	c3 9a       	sbi	0x18, 3	; 24
  cc:	c2 9a       	sbi	0x18, 2	; 24
  ce:	96 95       	lsr	r25
  d0:	41 f0       	breq	.+16     	; 0xe2 <test+0x24>
  d2:	c2 98       	cbi	0x18, 2	; 24
  d4:	89 2f       	mov	r24, r25
  d6:	82 23       	and	r24, r18
  d8:	c1 f7       	brne	.-16     	; 0xca <test+0xc>
  da:	c3 98       	cbi	0x18, 3	; 24
  dc:	c2 9a       	sbi	0x18, 2	; 24
  de:	96 95       	lsr	r25
  e0:	c1 f7       	brne	.-16     	; 0xd2 <test+0x14>
  e2:	08 95       	ret

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The questions that begs for answers are:

Is this a problem?

How big a problem is it?

Why?

"He used to carry his guitar in a gunny sack, or sit beneath the tree by the railroad track. Oh the engineers would see him sitting in the shade, Strumming with the rhythm that the drivers made. People passing by, they would stop and say, "Oh, my, what that little country boy could play!" [Chuck Berry]

 

"Some questions have no answers."[C Baird] "There comes a point where the spoon-feeding has to stop and the independent thinking has to start." [C Lawson] "There are always ways to disagree, without being disagreeable."[E Weddington] "Words represent concepts. Use the wrong words, communicate the wrong concept." [J Morin] "Persistence only goes so far if you set yourself up for failure." [Kartman]

Last Edited: Tue. Mar 31, 2015 - 03:37 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

For me it's a surprise how bad the compiled code sometimes are. And the bad code show up where you least expect it.

So there are still a long way before a compiler make general code close to ASM, in speed/size. 

 

But perhaps there are a gcc expert that can tell why, and perhaps there are a flag to avoid the "error".

 

 

Add:

putting mask in a register give some OK code :) (without a loop counter)

Last Edited: Wed. Apr 1, 2015 - 08:30 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

But perhaps there are a gcc expert

Such a person does not really exist. GCC is written in two halves - a front end and a back end - you tend to get experts in one or the other. Issues like this often occur when one does not work well with the other.

 

When GCC compiles it compiles to a language and processor independent intermediate language called GIMPLE (this is what the front end does) then the backend is target specific and generates CPU specific code from the GIMPLE.

 

I forget what the compile time switch is but there's one you can set to have the compiler spew that intermediate code. Your best hope of understanding what's going on here is probably to find that then study that code and also take a look at the code generator for AVR and see how it would be interpreting it.

 

This:

 

https://gcc.gnu.org/onlinedocs/g...

 

reminds me it is -fdump-tree-gimple.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ok thanks I will look at it when I get home.

I remember when I was playing with the LCC compiler (the one imagecraft use/used), it has a backend called "symbolic" and that was a good way to see how the compiler was "thinking"

 

A small side question , are the optimizer a frontend or backend job ? 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You probably want to read this:

 

https://gcc.gnu.org/onlinedocs/g...

 

In fact back up to the very top and read the entire thing if really interested.

 

(actually a quick read of that reminds me I forgot all about GENERIC)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I consider it a problem because I don't like to give fuel to the "compilers produce much worse code than an assembler programmer would write" attitude.

In this case, the code produced is particularly poor, for inexplicable reasons (there is "C level optimization" that attempts to avoid the loop counter, and merge the shift/loop-end decision, and the compiler essentially UNDOES that optimization.)

That the reasons appear inexplicable is particularly worrying, since I don't know where else these decisions will affect my code.  :-(

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Late to the party, but...

 

I've seen other similarly silly code:

  WDT_REG = (1<<WDCE) | (1<<WDE);
 12a:	e0 e6       	ldi	r30, 0x60	; 96
 12c:	f0 e0       	ldi	r31, 0x00	; 0
 12e:	88 e1       	ldi	r24, 0x18	; 24
 130:	80 83       	st	Z, r24
  WDT_REG = (1<<WDP3) | (1<<WDP0);
 132:	81 e2       	ldi	r24, 0x21	; 33
 134:	80 83       	st	Z, r24

... although likely not related to your example.

 

This is two cycles slower than it would be without involving Z, and is no smaller.

 

It gets particularly annoying when it happens in an ISR (and it does), which has the effect of costing an additional 8 cycles to push and pop r30/31.

 

When I first saw it, I thought 'Ah!  The compiler must have done an analysis of nearby code and seen that there are references to nearby memory, so it will use st Z+K rN instead of sts ADDR, rN in order to produce smaller (if not faster) code.

 

Nope:

  WDT_REG = (1<<WDCE) | (1<<WDE);
 12a:	e0 e6       	ldi	r30, 0x60	; 96
 12c:	f0 e0       	ldi	r31, 0x00	; 0
 12e:	88 e1       	ldi	r24, 0x18	; 24
 130:	80 83       	st	Z, r24
  WDT_REG = (1<<WDP3) | (1<<WDP0);
 132:	81 e2       	ldi	r24, 0x21	; 33
 134:	80 83       	st	Z, r24

  CLKPR = (1<<CLKPCE);
 136:	e1 e6       	ldi	r30, 0x61	; 97
 138:	f0 e0       	ldi	r31, 0x00	; 0
 13a:	80 e8       	ldi	r24, 0x80	; 128
 13c:	80 83       	st	Z, r24
  CLKPR = 0;
 13e:	10 82       	st	Z, r1

Note how Z is explicitly loaded with the address of WDT_REG, then again with that of CLKPR, when the compiler could at least have done something like:

  WDT_REG = (1<<WDCE) | (1<<WDE);
 12a:	e0 e6       	ldi	r30, 0x60	; 96
 12c:	f0 e0       	ldi	r31, 0x00	; 0
 12e:	88 e1       	ldi	r24, 0x18	; 24
 130:	80 83       	st	Z, r24
  WDT_REG = (1<<WDP3) | (1<<WDP0);
 132:	81 e2       	ldi	r24, 0x21	; 33
 134:	80 83       	st	Z, r24

  CLKPR = (1<<CLKPCE);
     	80 e8       	ldi	r24, 0x80	; 128
     	            	st	Z+1 r24
  CLKPR = 0;
     	            	st	Z+1 r1

... but it fails completely.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"Read a lot.  Write a lot."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

@joeymorin: It looks like code I would suspect of -O0 .

I think it's strange even for -O0 .

I note it does LDI R31, 0 twice.

Loading the Z register could save flash if it's used three or more times.

 

Edit: tipo fixed

International Theophysical Year seems to have been forgotten..
Anyone remember the song Jukebox Band?

Last Edited: Sat. Apr 25, 2015 - 06:14 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:

@joeymorin: I looks like code I would suspect of -O0 .

I think it's strange even for -O0 .

It does look that way, but that's what I get from all optimisation levels (0, 1, 2, 3, and s).

 

When I first saw this (a while ago now) I had experimented a with various compiler options but to no avail.

 

By the way, this is with avr-gcc version 4.8.1 from the latest version of the toolchain (3.4.5).

 

The thing is I was pretty certain this was a regression, but I hadn't taken the time to pin down which version of the toolchain introduced it.  And just checking now I find that even my distro's stock AVR-GCC (4.3.4 on Ubuntu Lucid) generates the same awful code, so I must have been dreaming...

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"Read a lot.  Write a lot."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I wonder if for some reason the code has bypassed optimization.

Is there any way to ask the compiler?

International Theophysical Year seems to have been forgotten..
Anyone remember the song Jukebox Band?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You can get the compiler to dump its internal representations. 

 

But if this is a regression I'd try doing a binary search on intervening versions to find out where it actually broke then study that specific patch to see if the change was a side effect of something else or deliberate for some reason. I'll bet it's found to be a generic change that improved x86 or ARM code that had a detrimental side effect on AVR. 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

But if that is the case, there most be some cost benefit rules for the AVR that needs to be ajusted.
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just a note:

on a xmega a ST take 1 clk so your code are same size and speed in a xmega, and more general (if it just used it).

So perhaps this is a xmega change that has hit mega's as well. 

 

(and I don't think that it count extra push and pop this is a HW limitation that a thread needs to store shared parts.)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
on a xmega a ST take 1 clk so your code are same size and speed in a xmega, and more general (if it just used it).

So perhaps this is a xmega change that has hit mega's as well.

Perhaps.  Good point though.  However it would still be a missed optimisation to re-load Z instead of actually taking advantage of Z+K notation.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"Read a lot.  Write a lot."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If it is still interesting for any one, it looks like gcc option '-fno-tree-loop-ivcanon' disables creation of loop counter. You can check the description of '-ftree-loop-ivcanon' option in gcc manual (e.g. https://gcc.gnu.org/onlinedocs/gcc-4.8.0/gcc/Optimize-Options.html or, at least, any later version).