Advice needed on AVR GCC code optimization

Go To Last Post
24 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I do not want to start a new 'compiler war', I just need some advice from experienced users of the GCC tools (WinAVR):

I have been a defender of the IAR compiler since a long time. At least the generated code ist always very small AND fast. Support is ok, bugs are removed only in the 'next release' 3-4 months later, annual maintenance fees are higher than most of the other compilers .... we have heard all this before.
After all the good advice found in this forum regarding the GNU tools, I just downloaded the new AVR Studio and WinAVR. Installing is done with just a few mouseclicks, the compiler is very nicely integrated into AVR-Studio. After minutes I could compile and debug a first project. Very impressive!!!

But now I have some questions regarding the code generation. If I compile some of my old routines and compare the runtime to the IAR generated code, the GCC code is much slower. I just tried some very special routines (two different square root routines, which take both about 60-80% longer) which have to be fast for me.

Example:

unsigned int IntSqrt(unsigned long w1)
{
  unsigned long i1;
  unsigned int k0,k1;

  k0 = 512;
  k1 = 0;
  while (k0>0)
  {
    i1 = k1 + k0;
    i1 = i1 * i1;
    if(i1<=w1)
      k1 = k1+k0;
    k0 = k0 >> 1;
  }
  return(k1);
}

unsigned int Sqrt_FL(unsigned long value)
{  unsigned int  PosCnt = 0x8000;
   unsigned int  RetVal = 0;
   unsigned long Tmp = 0;
   unsigned char Flag = 0;

   while(PosCnt)
   {
     Tmp = (Tmp&0x0000FFFF)|((unsigned long)PosCnt<<16);
     Tmp = (Tmp>>1)|((unsigned long)RetVal<<16);
     if(Flag || (Tmp<=value))
     {
        value  -= Tmp;
        RetVal |= PosCnt;
     }
     Flag   = !(!(value&0x80000000L));
     value  = value<<1;
     PosCnt = PosCnt>>1;
   }
   return RetVal;
}

void main(void)
{
  volatile unsigned long value;
  volatile unsigned int result;

  value = 123456789;

  while(1)
  {
     result = IntSqrt(value);
     result = Sqrt_FL(value);
     value++;
  }
}

runtime IAR (GCC):
IntSqrt: 402 (750) cycles
Sqrt_FL: 607 (900) cycles

This is quite a big difference on what seems to be quite straightforward routines to me.
I have tried different optimization levels -O0 to -O3, -Os. Is this what I have to expect with the gcc compiler or am I doing something wrong?
Another detail:
I have to declare result as volatile, because at least the IAR would otherwise remove the complete calculation. If I remove the volatile declaration of value, which is senseless, then the GCC code for Sqrt_FL takes ~70 cycles longer!?!?
Every advice would be greatly appreciated, because I am really investigating to change from IAR to gcc (on AVR, MSP430 and ARM).

Regards,
Jörg.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi Jörg

With the actual WinAVR and AtMega32 (-O1) I have got:

IntSqrt: 746 cycles
Sqrt_FL: 944 cycles

=> Almost the same, seems like the IAR Compiler generates much faster code for this example!

I found, that the unsigned long multiplication (i1 = i1 * i1) takes 50 Cycles, this is pretty slow.

Regards, Peter

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Anybody tried this one with gcc 4.02?

Regards:

Uwe

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Peter,
apart from the unsigned long multiplikation, which is much slower, please have a look at the following code:
IAR:

    195               Tmp = (Tmp&0x0000FFFF)|((unsigned long)PosCnt<<16);
    196               Tmp = (Tmp>>1)|((unsigned long)RetVal<<16);
   \                     ??Sqrt_FL_0:
   \   00000016   01DF               MOVW    R27:R26, R31:R30
   \   00000018   01CA               MOVW    R25:R24, R21:R20
   \   0000001A   95B6               LSR     R27
   \   0000001C   95A7               ROR     R26
   \   0000001E   9597               ROR     R25
   \   00000020   9587               ROR     R24
   \   00000022   01B0               MOVW    R23:R22, R1:R0
   \   00000024   01AC               MOVW    R21:R20, R25:R24
   \   00000026   2B6A               OR      R22, R26
   \   00000028   2B7B               OR      R23, R27
    197               if(Flag || (Tmp<=value))

GCC:

28:            Tmp = (Tmp&0x0000FFFF)|((unsigned long)PosCnt<<16);
+00000070:   7040        ANDI    R20,0x00         Logical AND with immediate
+00000071:   7050        ANDI    R21,0x00         Logical AND with immediate
+00000072:   01CB        MOVW    R24,R22          Copy register pair
+00000073:   27AA        CLR     R26              Clear Register
+00000074:   27BB        CLR     R27              Clear Register
+00000075:   01DC        MOVW    R26,R24          Copy register pair
+00000076:   2799        CLR     R25              Clear Register
+00000077:   2788        CLR     R24              Clear Register
+00000078:   2B28        OR      R18,R24          Logical OR
+00000079:   2B39        OR      R19,R25          Logical OR
+0000007A:   2B4A        OR      R20,R26          Logical OR
+0000007B:   2B5B        OR      R21,R27          Logical OR
29:            Tmp = (Tmp>>1)|((unsigned long)RetVal<<16);
+0000007C:   9556        LSR     R21              Logical shift right
+0000007D:   9547        ROR     R20              Rotate right through carry
+0000007E:   9537        ROR     R19              Rotate right through carry
+0000007F:   9527        ROR     R18              Rotate right through carry
+00000080:   01C8        MOVW    R24,R16          Copy register pair
+00000081:   27AA        CLR     R26              Clear Register
+00000082:   27BB        CLR     R27              Clear Register
+00000083:   01DC        MOVW    R26,R24          Copy register pair
+00000084:   2799        CLR     R25              Clear Register
+00000085:   2788        CLR     R24              Clear Register
+00000086:   2B28        OR      R18,R24          Logical OR
+00000087:   2B39        OR      R19,R25          Logical OR
+00000088:   2B4A        OR      R20,R26          Logical OR
+00000089:   2B5B        OR      R21,R27          Logical OR
30:            if(Flag || (Tmp<=value))

My point is, if the gcc generates much slower code for these simple expressions, than I can expect the gcc code to be slower in nearly all situations. Is this assumption correct?
Maybe I can live with slower code, this is not the question at the moment. I just want to be sure about what I can expect when I change to gcc.

Regards,
Jörg.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

> My point is, if the gcc generates much slower code for these simple
> expressions, than I can expect the gcc code to be slower in nearly
> all situations. Is this assumption correct?

No, this kind of generalization is never correct. The matter is sheer
too complex for that kind of assumptions. Picking just a simple piece
of code as a ``benchmark'' simply does not work.

In the project I'm currently working on, we get between 10 and 20 %
better code with IAR, compared to GCC 3.4.4. The project currently
runs at about 30...40 KB of code size (optimized), so it can be
considered a non-trivial one I think.

Sure, there are weak points with GCC for the AVR (most of them due to
the fact the compiler has been designed for much larger processors),
but the guy who ported the butterfly application to GCC (Martin
Thomas) reported that he could even get tighter code for the butterfly
than the original IAR was (which kind of surprised me).

Occasionally, it can also be found that certain ways to write the code
work better than others. For example, GCC usually operates on `int'
types for all kind of integer expressions, as this is what the
standard says. Well, it needs to behave as if the expression *were*
an int, but as for all the major GCC targets there's no difference
between these, almost no attempt is made to optimize the calculation
down to 8 bits from that.

Jörg Wunsch

Please don't send me PMs, use email if you want to approach me personally.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks Jörg,
this is exactly the kind of information I hoped to get. Would be nice to port one of our bigger projects to gcc syntax to get an even better idea, but this would take at least some hours due to different syntax, etc.

Best regards,
Jörg.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

dl8dtl wrote:
For example, GCC usually operates on `int'
types for all kind of integer expressions, as this is what the
standard says. Well, it needs to behave as if the expression *were*
an int, but as for all the major GCC targets there's no difference
between these, almost no attempt is made to optimize the calculation
down to 8 bits from that.

Apparently, there used to be. I am still using avr-gcc 3.0.3 for an ATtiny-based product because its successors takes roughly 15% more flash to create and manipulate a lot of unused, zero high bytes. I would really like to use more current compilers but that's a little hard when they need 15% more than the part has. :(

- John

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Does any of the C Compilers support bit variables like the Keil C compiler for the 8051 family does? (Iknow it's not standard C!).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Fried wrote:
Does any of the C Compilers support bit variables like the Keil C compiler for the 8051 family does? (Iknow it's not standard C!).

Surely in all C compilers you can form a union with a struct containing a uchar and 1 bit fields and cast this onto a byte to access the bits individually?

(though it's true the ordering of the bits in a struct may not be portable to other compilers)

Cliff

Last Edited: Fri. Nov 18, 2005 - 02:07 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Codevision has a bit data type.

Randy

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You may help the compiler by using a union. This is pretty optimized.

unsigned int Sqrt_FL(unsigned long value)
{ 
  unsigned int PosCnt = 0x8000;
  unsigned int RetVal = 0;
  union {
   unsigned long l;
   unsigned int w[2];
   unsigned char b[4];
  } Tmp;
  unsigned char Flag = 0;

  Tmp.l = 0L;
  while(PosCnt)
  {
   //Tmp = (Tmp&0x0000FFFF)|((unsigned long)PosCnt<<16);
   Tmp.w[1] = PosCnt;
   //Tmp = (Tmp>>1)|((unsigned long)RetVal<<16);
   Tmp.l >>= 1;
   Tmp.w[1] |= RetVal;
   if(Flag || (Tmp.l<=value))
   {
    value -= Tmp.l;
    RetVal |= PosCnt;
   }
   Flag = ((value&0x80000000L) != 0);
   value = value<<1;
   PosCnt = PosCnt>>1;
  }
  return RetVal;
}

... the only thing you cannot unscramble is eggs...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Fried wrote:
Does any of the C Compilers support bit variables like the Keil C compiler for the 8051 family does? (Iknow it's not standard C!).

clawson wrote:
Surely in all C compilers you can form a union with a struct containing a uchar and 1 bit fields and cast this onto a byte to access the bits individually?

Yes, you can even do this in Keil.
But you don't really have a single-bit variable; you just have a name for a bit-field within a larger variable
This is usually implemented with (at least) byte-wide shift and/or mask operations (even Keil) - so it gains nothing. In some cases, it is actually better to write it yourself with standard 'C' bitwise operators! :(

What Keil (and other 8051 compilers) gives you is a 'C' language extension that gives you acces to the single-bit instructions of the 8051.

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

dl8dtl wrote:
GCC usually operates on `int' types for all kind of integer expressions, as this is what the standard says ... almost no attempt is made to optimize the calculation down to 8 bits from that.

Since Keil's C51 has already been mentioned, I'll just note for info that it does provide an option to disable this "integer promotion"
It can make a significant difference!

Does IAR have a similar option?

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

> This is usually implemented with (at least) byte-wide shift and/or
> mask operations (even Keil) - so it gains nothing. In some cases, it
> is actually better to write it yourself with standard 'C' bitwise
> operators!

For AVR-GCC (to get back on-topic), there's no "bit"-organized memory
area in the processor anyway, but there are quite efficient bit access
methods to registers (and IO registers). GCC uses them whenever
possible in bit operations. So really, both the bitfield approach as
well as standard C bit manipulation operations all yield the same
efficiency in access, and compared to the rather slow (standard)
MCS51, they are probably even way faster anyway.

If Keil differs between standard bit operators and bitfields in the
efficiency of the generated code, I'd rather take that as an
indication of design problems inside that compiler. The way the C
programmer expressed itself to achieve a certain goal should not make
a difference to the compiler's code generation, at least for obvious
cases like this one.

Jörg Wunsch

Please don't send me PMs, use email if you want to approach me personally.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

So guys can we draw a conclusion.
avr-gcc slower or more flashconsumption.

Find this information very interresting.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Pinkpanter wrote:
So guys can we draw a conclusion.
avr-gcc slower or more flashconsumption.

Find this information very interresting.

avr-gcc is free and have great support :wink:

/Bingo

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

> avr-gcc slower or more flashconsumption.

Compared to what, please? For which problem?

Yes, apples are better than oranges.

Anyway, I found it really interesting that for the example I mentioned
above (where we are using both, IAR and GCC to compile a project), in
a really time-critical path of the project, the GCC code performed
quite a bit better than IAR's. Even when turning IAR's optimization
level from -z9 (best optimization for size, that's what we've been
using initially) to -s9 (best optimization for speed), while the
generated code got even smaller, it performed about the same as GCC
with -Os. Telling GCC to aggressively optimize for speed (-O3) made
the entire code about 10 % larger but also yet another 10 % faster.

Again, that's a very particular problem, and I don't claim any other
problem would get nearly similar results at all. However, it's a
real-world application I'm working on, not just any silly benchmark of
any kind.

Summary of that particular problem:

Compiler:          IAR EW 4.10A          AVR-GCC 3.4.4
optimization:      -z9        -s9        -Os        -O3

Code size of
entire project:    30 K       30 K       36 K       40 K

Response time of
a particularly
important code
path:              240 µs     200 µs    200 µs      170 µs

Jörg Wunsch

Please don't send me PMs, use email if you want to approach me personally.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Jörg,
could you be more precise about what sort of C-code 'performs' better when compiled by GCC? Would be very informative.
I have tested quite a lot of different C-compilers on the AVR and up to now have _always_ found that the IAR-apple was 'better' than all the other oranges, or pears or ... (in terms of speed, sometimes also code size).
Ok, I nearly never use any library code as floating point, printf, etc. There may be other compilers with better libraries around.
I always use maximum speed optimization, because this mostly produces very compact code, too. This is at least true for small projects with little possibility for code reuse.

to wBoellmann:

Quote:

You may help the compiler by using a union. This is pretty optimized.

Yes, I know, but I really do not want to help the compiler. The compiler should help me instead :D . (Wish, it would correct misspelled variable names, add forgotten semicolons and brackets....)

Jörg.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well, ok, IAR is optimized for AVR architecture and performes well.
Good for everyone who can spend a lot of money easiliy.

BUT:
avr-gcc is getting better and better.
I am really looking forward for the integration of the binutils-patches from Björn Hasse into the avr-gcc distribution.

What I find important is to collect all code examples, where gcc performs badly, to have a repository where people who want to optimize avr-gcc can look at (good subject for a theses at university).

If many people cooperate, performance of gcc can be improved efficiently.

A thing, that would be of higher importance for me would be the integration of MISRA compliance tests into gcc (well, or as independant add-on).

Regards:

Uwe

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

> could you be more precise about what sort of C-code 'performs'
> better when compiled by GCC? Would be very informative.

Unfortunately not. First, this code is (currently) proprietary,
second, it's a large state machine with several actions and quite long
and windy code paths, so it's impossible to ``just extract'' the part
in question as a showcase.

> ... up to now have _always_ found that the IAR-apple was 'better'
> than all the other oranges ...

Me, too, the more I have been surprised this was not the case for that
project. We didn't do an in-depth analysis, we write the code and
compile it with GCC all the time, yet we take care to keep it IAR
compatible. At a particular point, we noticed there is a timing
problem, so we started optimizing all parts contributing to that. So
we eventually ended up improving the timing from an original ~ 800 µs
response time down to the ~ 200 µs we were supposed to get at. Only
then we also verified IAR again. As the timing of the IAR-generated
code was still within our target window (though worse than GCC), we
didn't try to make compiler-dependent optimizations. (The
optimizations were real optimizations, not just fiddling slightly with
the code. The biggest saving was done by avoiding an overcautious
memset(), only 1 line of C code resulting in 5 lines of assembly code,
but that took 185 µs. We eventually had to review any code on the
critical path, to save 20 µs here and 15 µs there.)

Jörg Wunsch

Please don't send me PMs, use email if you want to approach me personally.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I never use IAR so I'm not really sure, but from the C source posted, I see that the type "unsigned int" is used for 16-bit integer. I don't know if this is true for IAR, but it is not for GCC. According to ANSI C standard, an "int" is signed 32-bit integer. GCC is designed as a standard C compiler, so an "unsigned int" is the same as "unsigned long" in GCC. If IAR is not standard, it may treat "unsigned int" as "unsigned short". If this is the case, it may explain the big performance difference between IAR and GCC in the result of the benchmark....

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

> According to ANSI C standard, an "int" is signed 32-bit integer.

Your knowledge of the C standard appears to be not much more advanced
than your name is...

IOW, that statement is *completely* wrong.

Jörg Wunsch

Please don't send me PMs, use email if you want to approach me personally.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

As far as I know, "short" is *guaranteed* to be 16 bits, but int can be any size the compiler wants.

- Dean :twisted:

Make Atmel Studio better with my free extensions. Open source and feedback welcome!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

AFAIK, Dean that's not quite true either.

But I would suggest that questions regarding the C standard be put to the gurus over on the comp.lang.c newsgroup where there are many more language lawyers hanging around there.

On the AVR:
char = 8 bits
short = 16 bits
int = 16 bits
long = 32 bits
long long = 64 bits

But ideally, one should use the types that are found in because they provide: standard integer types, just like the name says! This makes it much easier to specify and determine exactly what size and signedness is being declared.

@Uwe
Yes, I'm also looking forward to the work done by Björn. To that end, the next WinAVR release is including GCC 4.0.2, and yes, I *do* have it built for MinGW (i.e. a native Windows executable). I'm currently working getting the other projects built, such as avrdude and Insight, plus I need to verify that all the patches for all the new devices are properly rolled in. I, too, would be interested in having a MISRA checker integrated into GCC. However, of course, it needs a volunteer.

And for everyone else, FYI, I can personally vouch for Joerg Wunsch's example that he is illustrating. A lot does depend on the type of code that is being compiled. That's why one should be cautious when presented with comparison test data between two compilers. One should always be able to know *exactly* how the code is compiled, what settings were used, and be able to have access to the source code that was being used in comparisons, that way you can reproduce the tests yourself.

Eric Weddington