Inappropriate "optimization" of loop to division

Go To Last Post
17 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have AVR code:

static unsigned char timer0_fract;

void foo (void) 
{
	unsigned char fractmp = timer0_fract;

	fractmp += 3;
	while (fractmp >= (unsigned char)125) {
	    fractmp -= (unsigned char) 125;
	}
	timer0_fract = fractmp;
}

void bar (void)
{
	unsigned char fractmp = timer0_fract;

	fractmp += 3;
	if (fractmp >= (unsigned char)125) {
	    fractmp -= (unsigned char) 125;
	}
	timer0_fract = fractmp;
}

avr-gcc "cleverly" recognizes the while loop as performing division, and ends up calling a divide function instead of doing an actual loop. This is undesirable; I happen to KNOW that the loop will execute a relatively small number of times, and in the important case, only once. I do not want the overhead of calling divide functions, I do not want the space consumed by divide functions. How can I defeat the compilers cleverness? (You'll notice that if it was slightly MORE clever, it could put a maximum of about 2 times through the while loop, based on the constants used. Sigh.) (The compiler makes the same optimization for "do {} while (cond)" and "for (;cond;) {}" style loops.)

This apparently happened in between 4.0.2 and 4.3.0 versions of the compiler. I really don't like being second-guessed to this extent...

(here is the code produced, from objdump of the .o file)

Disassembly of section .text:

00000000 :
static unsigned char timer0_fract;

void foo (void) 
{
   0:   20 91 00 00     lds     r18, 0x0000
   4:   2d 5f           subi    r18, 0xFD       ; 253

        fractmp += 3;
        while (fractmp >= (unsigned char)125) {
            fractmp -= (unsigned char) 125;
        }
        timer0_fract = fractmp;
   6:   82 2f           mov     r24, r18
   8:   6d e7           ldi     r22, 0x7D       ; 125
   a:   0e 94 00 00     call    0       ; [some divide function]
   e:   93 e8           ldi     r25, 0x83       ; 131
  10:   89 9f           mul     r24, r25
  12:   80 2d           mov     r24, r0
  14:   11 24           eor     r1, r1
  16:   28 0f           add     r18, r24
  18:   20 93 00 00     sts     0x0000, r18
}
  1c:   08 95           ret

0000001e :

void bar (void)
{
  1e:   90 91 00 00     lds     r25, 0x0000
        unsigned char fractmp = timer0_fract;

        fractmp += 3;
  22:   89 2f           mov     r24, r25
  24:   8d 5f           subi    r24, 0xFD       ; 253
        if (fractmp >= (unsigned char)125) {
  26:   8d 37           cpi     r24, 0x7D       ; 125
  28:   00 f0           brcs    .+0             ; 0x2a 
            fractmp -= (unsigned char) 125;
  2a:   8d 57           subi    r24, 0x7D       ; 125
        }
        timer0_fract = fractmp;
  2c:   80 93 00 00     sts     0x0000, r24
}
  30:   08 95           ret

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Making fractmp volatile produces this:

static unsigned char timer0_fract; 

void foo (void) 
{ 
  9a:	df 93       	push	r29
  9c:	cf 93       	push	r28
  9e:	0f 92       	push	r0
  a0:	cd b7       	in	r28, 0x3d	; 61
  a2:	de b7       	in	r29, 0x3e	; 62
   volatile unsigned char fractmp = timer0_fract; 
  a4:	80 91 60 00 	lds	r24, 0x0060
  a8:	89 83       	std	Y+1, r24	; 0x01

   fractmp += 3; 
  aa:	89 81       	ldd	r24, Y+1	; 0x01
  ac:	8d 5f       	subi	r24, 0xFD	; 253
  ae:	02 c0       	rjmp	.+4      	; 0xb4 
   while (fractmp >= (unsigned char)125) { 
       fractmp -= (unsigned char) 125; 
  b0:	89 81       	ldd	r24, Y+1	; 0x01
  b2:	8d 57       	subi	r24, 0x7D	; 125
  b4:	89 83       	std	Y+1, r24	; 0x01
void foo (void) 
{ 
   volatile unsigned char fractmp = timer0_fract; 

   fractmp += 3; 
   while (fractmp >= (unsigned char)125) { 
  b6:	89 81       	ldd	r24, Y+1	; 0x01
  b8:	8d 37       	cpi	r24, 0x7D	; 125
  ba:	d0 f7       	brcc	.-12     	; 0xb0 
       fractmp -= (unsigned char) 125; 
   } 
   timer0_fract = fractmp; 
  bc:	89 81       	ldd	r24, Y+1	; 0x01
  be:	80 93 60 00 	sts	0x0060, r24
}
  c2:	0f 90       	pop	r0
  c4:	cf 91       	pop	r28
  c6:	df 91       	pop	r29
  c8:	08 95       	ret

Maybe not quite what you were after but there's no CALL to the __udivmodqi4 function ;-)

[actually I thought the compiler would say "how can an automatic be volatile? - don't be stupid" but apparently it doesn't]

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This is wild. 4.2.2 does not have this "feature" - one more reason to avoid update to the newest version at all cost.

I thought I could cheat it with a goto:

   fractmp = timer0_fract;
   fractmp += 3;
label:
   if (fractmp >= (unsigned char)125) {
     fractmp -= (unsigned char) 125;
     goto label;
   }
   timer0_fract = fractmp;

but it was "smarter" than that... :-(

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
[actually I thought the compiler would say "how can an automatic be volatile? - don't be stupid" but apparently it doesn't]
Even an automatic can be changed by an interrupt (through a pointer).

volatile uint8_t flag = 0;
SetGlobalWritePointerForInterrupt(&flag);
while (!flag);  //wait for event

Perhaps it is not really useful, but possible. ;-)

Stefan Ernst

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

(through a pointer).

Oh yes, of course, now I understand!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
4.2.2 does not have this "feature"
I still see the same behavior in 4.3.2 :-(

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I still see the same behavior in 4.3.2

This is with or without the volatile?

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

without the volatile.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Then I guess I'm a little confused. From what is posted, it would seem to me that the optimizer got better over the years, and replaced what it detected as repeated subtraction as better being done as division.

I'll also surmise that you chose this brand of compiler 'cause it has a great classical code generator that seeks out and destroys less-than-optimal coding practices that the lowly programmer gives to it.

So, in certain cases you know better. From your discussion you purposefully created that sequence 'cause you know the conditions. But you probably (not noted above that I can see) told the compiler to be aggressive with optimization.

As you are well aware that you want particular output, it seems that making these "volatile" is a simple workaround to still allow aggressive compiler optimization, but tells it to leave its hands off this sequence.

Maybe that is too onerus in a whole (tight) program that is riddled with this and similar? I don't know what I'd advise then. For this mere mortal there might be a couple tricky bits here-and-there and a workaround is sufficient.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Pass the variable though inline assembler:

asm("" : "+r"(fractmp));
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

TimothyEBaldwin wrote:
Pass the variable though inline assembler:

asm("" : "+r"(fractmp));


... inside the loop. Wonderful.

This is the second time in a week's time I hear about a such trick as a cheaper replacement for volatile. Should this go to some "tips & tricks" part of this Forum or some of the manuals?

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
Then I guess I'm a little confused. From what is posted, it would seem to me that the optimizer got better over the years, and replaced what it detected as repeated subtraction as better being done as division.
A modulus operation might be faster on a machine with hardware for it,
but on an AVR, with an unsigned char for one operand and 125 for the other,
the "optimization" makes no sense.

The loop could be replaced with two ifs.
That would probably do what the OP wants.

Moderation in all things. -- ancient proverb

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I would point out that neither gcc 4.2.1 or Intel's icc for x86 feels like it needs to make this sort of optimization, even though it has a much faster divide operation that the AVR...

This is code aimed at an ISR, so "as tight as possible" would be nice. The "volatile" solution is clearly "correct", but is slightly more painful than necessary.

The example was compiled with -Os
I don't THINK that the version with the divide is actually shorter than an actual loop would be. (The "if" version is significantly shorter.)

So, this is someone (?) being clever "I recognized a divide!" It's not shorter, it's not what the programmer intended, it's a construct where the programmer might well have had a good reason for using the loop, it's not even faster for a significant number of cases, and it's "obviously" not faster for the specific case I claim bogus! Grr.

asm("" : "+r"(fractmp)); 

Huh? While my inclination is to believe that it would be clearer to write the whole function in asm than to use gcc's mysterious asm syntax, I'm not seeing how this sample bit would fit in...

Quote:
The loop could be replaced with two ifs.
Yes, in this particular case. Unfortunately, this code actually occurs at the bottom of some preprocessor magic that will sometimes result in fractmp being a ulong, and the while loop possibly looping more than twice (though still consuming fewer cycles than an actual divide!)

The full code is a timer0 overflow ISR that calculates a millisecond timestamp, BTW. Sometimes (depending on CPU clockrate) it's able to do faster magic by knowing that each tick is some number of milliseconds plus some number of microseconds where the latter can fit in a byte (if scaled.) For other (less common) clockrates it's not so lucky and does long math for both parts, essentially.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Patches to the compiler are welcome.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I don't THINK that the version with the divide is actually shorter than an actual loop would be.

Has someone actually used the cycle counter in the simulator and verified that? I have to admit I didn't look at exactly what __udivmodqi4 () consists of.

Cliff

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

avr-gcc wrote:

000000ca <__udivmodqi4>:
  ca:	99 1b       	sub	r25, r25
  cc:	79 e0       	ldi	r23, 0x09	; 9
  ce:	04 c0       	rjmp	.+8      	; 0xd8 <__udivmodqi4_ep>

000000d0 <__udivmodqi4_loop>:
  d0:	99 1f       	adc	r25, r25
  d2:	96 17       	cp	r25, r22
  d4:	08 f0       	brcs	.+2      	; 0xd8 <__udivmodqi4_ep>
  d6:	96 1b       	sub	r25, r22

000000d8 <__udivmodqi4_ep>:
  d8:	88 1f       	adc	r24, r24
  da:	7a 95       	dec	r23
  dc:	c9 f7       	brne	.-14     	; 0xd0 <__udivmodqi4_loop>
  de:	80 95       	com	r24
  e0:	08 95       	ret

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

westfw wrote:
Quote:
The loop could be replaced with two ifs.
Yes, in this particular case. Unfortunately, this code actually occurs at the bottom of some preprocessor magic that will sometimes result in fractmp being a ulong, and the while loop possibly looping more than twice (though still consuming fewer cycles than an actual divide!)

Perhaps the preprocessor magic could include more ifs, e.g.:
if(sizeof(dividend)==1) {
    if(divisor==1) dividend=0;
    if(divisor*64<=dividend) dividend-=divisor*64;
    if(divisor*32<=dividend) dividend-=divisor*32;
    if(divisor*16<=dividend) dividend-=divisor*16;
    if(divisor* 8<=dividend) dividend-=divisor* 8;
    if(divisor* 4<=dividend) dividend-=divisor* 4;
    if(divisor* 2<=dividend) dividend-=divisor* 2;
    if(divisor* 1<=dividend) dividend-=divisor* 1;
} else if(sizeof(dividend)==2) {
    ....
}

In the case at hand, all but two of the ifs should be optimized away.
There is, of course, the possibility that this too will be recognized as a divide.

Quote:
The full code is a timer0 overflow ISR that calculates a millisecond timestamp, BTW. Sometimes (depending on CPU clockrate) it's able to do faster magic by knowing that each tick is some number of milliseconds plus some number of microseconds where the latter can fit in a byte (if scaled.) For other (less common) clockrates it's not so lucky and does long math for both parts, essentially.

Moderation in all things. -- ancient proverb