Strange disoptimization on AVR-GCC 4.3.0

Go To Last Post
9 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I tried this code:

#include 

#define TIME    0x0100

ISR( INT0_vect )
{
  int16_t x = TCNT1;

  x -= TIME / 2;
  if( x < 0 )
    x += TIME;

  TCNT1 = x;
}

And get this strange listing:

ISR( INT0_vect )
{
  5e:   1f 92           push    r1
  60:   0f 92           push    r0
  62:   0f b6           in      r0, 0x3f        ; 63
  64:   0f 92           push    r0
  66:   11 24           eor     r1, r1
  68:   2f 93           push    r18
  6a:   3f 93           push    r19
  6c:   8f 93           push    r24
  6e:   9f 93           push    r25
  int16_t x = TCNT1;
  70:   2c b5           in      r18, 0x2c       ; 44
  72:   3d b5           in      r19, 0x2d       ; 45

  x -= TIME / 2;
  74:   c9 01           movw    r24, r18
  76:   80 58           subi    r24, 0x80       ; 128
  78:   90 40           sbci    r25, 0x00       ; 0
  if( x < 0 )
  7a:   97 ff           sbrs    r25, 7
  7c:   03 c0           rjmp    .+6             ; 0x84 <__vector_1+0x26>
    x += TIME;
  7e:   c9 01           movw    r24, r18
  80:   80 58           subi    r24, 0x80       ; 128
  82:   9f 4f           sbci    r25, 0xFF       ; 255

  TCNT1 = x;
  84:   9d bd           out     0x2d, r25       ; 45
  86:   8c bd           out     0x2c, r24       ; 44
}
  88:   9f 91           pop     r25
  8a:   8f 91           pop     r24
  8c:   3f 91           pop     r19
  8e:   2f 91           pop     r18
  90:   0f 90           pop     r0
  92:   0f be           out     0x3f, r0        ; 63
  94:   0f 90           pop     r0
  96:   1f 90           pop     r1
  98:   18 95           reti

I understand not the useless fiddling with two more registers (R18, R19).

Has anybody an idea, why this disoptimization occur?
Or how it can be switched off?

It cost about 20% of the interrupt execution time.

Peter

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

danni wrote:
Has anybody an idea, why this disoptimization occur?

It appears that the optimiser has eliminated the second arithmetical expression evaluation, but used two more registers in the process.

[EDIT: Sorry, brain block. This is incorrect. What follows is fine though, I think.]

The following:

ISR( INT0_vect )
{
  int16_t x = TCNT1;

  if( x >= TIME/2 )
    x -= TIME/2;
  else
    x += TIME/2;

  TCNT1 = x;
} 

compiles to slightly fewer instructions (I haven't counted cycles or code space), and doesn't seem to use the extra registers.

+00000083:   921F        PUSH      R1             Push register on stack
+00000084:   920F        PUSH      R0             Push register on stack
+00000085:   B60F        IN        R0,0x3F        In from I/O location
+00000086:   920F        PUSH      R0             Push register on stack
+00000087:   2411        CLR       R1             Clear Register
+00000088:   938F        PUSH      R24            Push register on stack
+00000089:   939F        PUSH      R25            Push register on stack
29:         int16_t x = TCNT1;
+0000008A:   91800084    LDS       R24,0x0084     Load direct from data space
+0000008C:   91900085    LDS       R25,0x0085     Load direct from data space
31:         if( x >= TIME / 2 )
+0000008E:   3880        CPI       R24,0x80       Compare with immediate
+0000008F:   0591        CPC       R25,R1         Compare with carry
+00000090:   F01C        BRLT      PC+0x04        Branch if less than, signed
32:           x -= TIME/2;
+00000091:   5880        SUBI      R24,0x80       Subtract immediate
+00000092:   4090        SBCI      R25,0x00       Subtract immediate with carry
+00000093:   C002        RJMP      PC+0x0003      Relative jump
34:           x += TIME/2;
+00000094:   5880        SUBI      R24,0x80       Subtract immediate
+00000095:   4F9F        SBCI      R25,0xFF       Subtract immediate with carry
36:         TCNT1 = x;
+00000096:   93900085    STS       0x0085,R25     Store direct to data space
+00000098:   93800084    STS       0x0084,R24     Store direct to data space
37:       }
+0000009A:   919F        POP       R25            Pop register from stack
+0000009B:   918F        POP       R24            Pop register from stack
+0000009C:   900F        POP       R0             Pop register from stack
+0000009D:   BE0F        OUT       0x3F,R0        Out to I/O location
+0000009E:   900F        POP       R0             Pop register from stack
+0000009F:   901F        POP       R1             Pop register from stack
+000000A0:   9518        RETI 

You can still do better with hand-crafted ASM, I think, by combining the comparison with the subtraction, but I haven't taken time to check the details. You could certainly save SREG via R24, and you don't need R1, so you can eliminate two pairs of PUSH/POP instructions (for R0 and R1) and the CLR R1.

Christopher Hicks
==

Last Edited: Wed. Feb 11, 2009 - 10:07 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If the code is simple and you worry about a few clock cycles, I would advise you to implement the ISR in assembler. That would also save the push/pop of r0/r1 and the eor.

Stefan Ernst

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I found a solution, but I have absolutely no clue, how it works. :?:

I insert an empty assembler statement:

#include 

#define TIME    0x0100

ISR( INT0_vect )
{
  int16_t x = TCNT1;

  x -= TIME / 2;

  __asm__ volatile (""::);      // the magic optimizer !!!

  if( x < 0 )
    x += TIME;

  TCNT1 = x;
}

The listing:

ISR( INT0_vect )
{
  6c:   1f 92           push    r1
  6e:   0f 92           push    r0
  70:   0f b6           in      r0, 0x3f        ; 63
  72:   0f 92           push    r0
  74:   11 24           eor     r1, r1
  76:   8f 93           push    r24
  78:   9f 93           push    r25
  int16_t x = TCNT1;
  7a:   80 91 84 00     lds     r24, 0x0084
  7e:   90 91 85 00     lds     r25, 0x0085

  x -= TIME / 2;
  82:   80 58           subi    r24, 0x80       ; 128
  84:   90 40           sbci    r25, 0x00       ; 0

  __asm__ volatile (""::);      // the magic optimizer !!!

  if( x < 0 )
  86:   97 ff           sbrs    r25, 7
  88:   02 c0           rjmp    .+4             ; 0x8e <__vector_1+0x22>
    x += TIME;
  8a:   80 50           subi    r24, 0x00       ; 0
  8c:   9f 4f           sbci    r25, 0xFF       ; 255

  TCNT1 = x;
  8e:   90 93 85 00     sts     0x0085, r25
  92:   80 93 84 00     sts     0x0084, r24
}
  96:   9f 91           pop     r25
  98:   8f 91           pop     r24
  9a:   0f 90           pop     r0
  9c:   0f be           out     0x3f, r0        ; 63
  9e:   0f 90           pop     r0
  a0:   1f 90           pop     r1
  a2:   18 95           reti

Peter

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yoy must compile to different chip's
sts 0x0085 != in r18, 0x2c

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I had a similar strange optimization with a code like:

char x = a;
if (....) x = b;

this was in most cases translatet into something like

if (....) x = b;
else x = a;
,which is slower an longer.

only -O3 did the job near perfect, but made things longer somewhere else. Besides storing everything to RAM, -O1 was better than -O2 or -Os.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
Yoy must compile to different chip's
sts 0x0085 != in r18, 0x2c

Sorry, the second example was compiled for the ATmega88 instead ATmega8.

But this changes nothing on the problem and the solution.

I found this solution also working on other places, where R18 was loaded with R24 without any reason.

Peter

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

yes I was playing with your "fake" ASM code
and it make a kind of "flush" in the optimizer.
In your code it looks like the optimizer know that
your read of TCNT1 only get changed with constants
so it try to keep the value.

 x -= TIME / 2; 
  if( x < 0 ) 

if I use -O1 the optimizer know that the flags
from the first line can be used directly and it
just make a BRGE.
If I add the extra ASM line it it will add a TST
so something get flushed (don't count on flags !!)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

danni wrote:
I found a solution, but I have absolutely no clue, how it works. :?:

I insert an empty assembler statement:

That is quite odd. I can only assume that it is indeed flushing the state of the optimiser in some way. Nevertheless, if the length of this ISR is of critical concern, then hand-crafted assembler can save you more cycles (eliminating the push/pop or r1 and r2 as noted by me and "sternst" above), and (more importantly) relieve you of concern that a future compiler version might screw things up again.

CH
==