Problem with float operations

Go To Last Post
27 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0


Hi, I'm testing GCC float operations against IAR's.

Though I select ATmega644, which has enhanced core, the float arithmetic operations don't use MULS and MULU. This results in multiplication and addition five times slower than those obtained with IAR compiler. Also the division is two times slower.

Someone can tell me if there is something wrong or incomplete in my test project or if GCC doesn't provide optimized float routines at all?

thanks

Attachment(s): 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Try using Mfile to generate a Makefile for it to ensure that all the best options are being used.

Cliff

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thank you, Cliff.
I think my Makefile is correct, the only option to tell the compiler to use enhanced core instructions seems to be the -mmcu=mcu as stated in the gcc-4.1.1 document.
This works properly since I tested mcu other than ATmega644 and used SFRs present only in ATmega644; I get error messages. Instead, if I select the right MCU, it compiles and runs without errors.
Actually with -mmcu you can select the MCU type, not the instruction set (avr1..avr5), this should follow automatically the MCU type.
I must argue that there are no float optimized routines using enhanced core instructions...
...unless someone tells me I'm wrong (and why).

Franco

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

gcc doesn't know about those sort of operations. You will need to call an assembler routine if you want to use them in your program.

Leon

Leon Heller G1HSM

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

OK Leon.
I too was thinking this.
So it's time to write some assembler code.

Franco

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I haven't taken a look at your makefile, but are you linking to the math library? (Use -lm in the link flags) And are you including ?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Dear EW,
I have added -lm and included math.h.
I think that the above refer to math functions not just float arithmetic.
Please run the AVR Studio project attached and you'll be able to see the same slow timings as before.

Franco

Attachment(s): 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

All my trouble was due to a couple of nasty mistakes:

1)
I didn't use
GENDEPFLAGS = -MMD -MP -MF .dep/$(@F).d
to generate dependencies

2)
Some filenames had capital letters instead of lowercase so the compilation didn't run correctly.

The bottom line
GNU floating point is about 40% faster than IAR's. A very good score.

bye,
Franco

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

fbm3ga wrote:
GNU floating point is about 40% faster than IAR's.

On one particular test.

Benchmarks? What are they good for? Absolutley nothing.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I wanted to see if GNU could use the enhanced core instructions and indeed it does use them if correctly called.

After comparing IAR vs. GNU running my little benchmark I tried a more exhaustive test porting from IAR to GNU an industrial controller with ATmega644.
This application makes many different complicated calculations and, again, I got an average +40% performance.

I'm happy with this.
Franco

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I got an average +40% performance.

I found this quite hard to believe, especially in light of past FP quick tests like this with various AVR C compilers. And IAR is the Big Dog in ultimate performance, right?

Indeed, a peek at Franco's source shows:

volatile double x,y,z,s,t;

I'll wager a cold one that IAR actually did 64-bit "double" operations, and GCC did 32-bit "float" operations.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi Lee, you are welcome!
I didn't expect my statement could trigger any reply.
My simple benchmarks had the sole purpose to verify if GCC could support enhanced core instructions and how well it performed with floating point elementary operations.
IAR actually doesn't support 64-bit math and both IAR and GCC consider "double=float".
My tests tried to cover a wide range of mantissas to get an average cycle count, looking at the disassemble window in AVR Studio you can see that IAR doesn't use the MUL instruction. Anyway GCC has a better overall register and jump management. This leads to my positive rating of GCC.

The attachment contains two AVR Studio projects with the same test compiled with IAR and GCC. If you have AVR Studio you can see the results by yourself. This is only to account for what I wrote and without any claim to state if GCC is better than IAR.

Franco

Attachment(s): 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

fbm3ga wrote:
IAR actually doesn't support 64-bit math and both IAR and GCC consider "double=float".

Except that is not entirely true is it? (in fact this is one of the things that supposedly makes IAR worth $3,000!). From the IAR compiler reference manual (page 10 in my eval installation):
Quote:
SIZE OF DOUBLE FLOATING-POINT TYPE
Floating-point values are represented by 32- and 64-bit numbers in standard IEEE754 format. By enabling the compiler option --64bit_doubles, you can choose whether data declared as double should be represented with 32 bits or 64 bits. The data type float is always represented using 32 bits.

Though presumably, if you aren't specifying --64bit_doubles (and it's not a default in one of the menus perhaps??) then maybe IAR is just using 32 bits but, like Lee, I simply can't believe that 40% figure. If anything (and again why it costs $3,000) IAR is usually going to produce slightly more optimal code. If you really see a 40% hit with IAR then I'd take advanatge of some of the $3,000 worth of support you paid for and ask them to investigate.

Cliff

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sorry for my wrong reply, I consciously didn't specify --64bit_doubles because I don't need them.
We had many other big and small problems with IAR and asked for investigation.

Franco

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I consciously didn't specify --64bit_doubles because I don't need them.

Instead of responding like that, why not simply declare the variables as "float" and re-check? Then you are assured of comparing apples to apples.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This is exactly what I did in the two projects that you can see in my last attachment.
At this point (after inspecting the disassembled code from library cl3s-ec.r90) I should conclude that, in this version of compiler, IAR uses enhanced core instructions but does NOT use the hardware multiplier thus carrying out multiplications only using shift-add instructions.
I tried also 64-bit multiplications (only IAR) that are about three times slower than 32-bit ones.

Franco

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This seems almost unbelieveable when IAR is known as the "gold standard" to which all other compilers aspire. Are you sure the project was being built for the right device so that the compiler would know that the hardware multiply instructions would be available for use?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I agree with you. In the compiler options I select enhanced core and processor version -v3 that should define a core with hardware multiplier (and thus enable its use).
I, then, made a second test selecting ATmega128 and again no hardware multiplier usage.

This performance loss is not that important for my purpose so I won't investigate further.

Franco

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

inspecting the disassembled code from library cl3s-ec.r90

I may be completely off on this one, but IIRC IAR ships with two C libraries, a "legacy" CLIB and a newer one called DLIB. The filename suggests that the old one was used.

What I do remember more clearly from my IAR tryout is that there are separate versions of these libraries for multiplierless chips with the suffix -nomul.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thank you for this hint.
The "Lib" folder contains 116 libraries for the various "flavor" combinations (memory model, processor type, enhanced core, ELPM, 64-bit doubles) so, for the small memory model (up to 64k code) there are:
CL3S-64.R90
CL3S-EC.R90
CL3S-EC-.R90
CL3S-SF.R90
CL3S-SF-.R90

but with the processor model -v3 you can use only the first two libs. As you suggested I found also DL3S*.* and it was my duty give them a chance. Unfortunately the timings (# of clock cycles) are exactly the same so I could argue that HW multiplier is not in the guest list. Please note that this compiler dates back to 2001 so maybe newer versions will complain also the floating point enthusiasts.

Franco

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I checked another machine with a reasonably new IAR version (4.something) installed. Almost all of the libs in both clib and dlib folders had either a _mul or a _nomul suffix.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Please note that this compiler dates back to 2001...

Hmmm--IIRC there were few if any AVR models >>with<< a hardware multiplier in 2001. AT90S4433 and AT90S8535 and smaller certainly did not have it. ATmega103? I don't think so, but I never used the beastie.

So, summarizing the thread in the light of this new information: GCC was 3x to 5x slower than IAR. Then GCC was re-configured and magically became 40% faster. Then the reason was "IAR doesn't use the MUL instruction". Now we find out that when the compiler was written there WAS no MUL instruction in AVRs. You should consider a career move doing benchmarks.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

theusch wrote:
So, summarizing the thread in the light of this new information: GCC was 3x to 5x slower than IAR. Then GCC was re-configured and magically became 40% faster.

Lee


I think I can offer a few possible explanations for that one.

By default, avr-gcc will use GCC's built-in generic, written-in-C, multi-platform, incomplete floating point library. In the context of an AVR, it is a useless anachronism, but it is an intrinsic part of using this multi-platform compiler.

An alternative floating point library is included as part of the separate yet intimately linked package called avr-libc. You have to specifically instruct avr-gcc's linker to ignore GCC's built-in floating point routines and use the routines from avr-libc instead. The default makefile provided as part of WinAVR already contains that instruction; the makefile generated by AVR Studio mysteriously does not.

(Everybody who uses avr-gcc is already using avr-libc whenever they pull in the part-specific I/O header files, or call functions like printf(), etc for which GCC has no intrinsic implementation, but they often don't realize that there's a distinction between the compiler and the library.)

Apparently, the misconfigurations in the makefile noted by the OP (most likely this one):

Quote:
2)
Some filenames had capital letters instead of lowercase so the compilation didn't run correctly.

was preventing the make utility from actually doing its job, so the revised version of the OP's benchmark (the one which actually used avr-libc's math library) wasn't actually getting built at first.

That would have left the previous hex files (the ones from the previous test, in which GCC's default floating point implementation was being used unintentionally) unchanged, making the OP think that linking in the math library wasn't helping at all.

(Let that serve as a lesson to everybody: never ignore any warnings or errors spouted out to you in the course of compiling and linking your code, no matter how benign they may appear.)

Previous comparisons that you (Lee) have cited were performed well over a year ago.

I thought I heard some rumblings on this forum about a recent re-write of avr-libc's floating point library earlier this year. A scan of avr-libc's changelog turns up this:

Quote:
2007-01-14 Dmitry Xmelkov

* bootstrap: Version 2.60 for autoconf is added.

New version of math library:
* libm/fplib/{dtostre.S,dtostrf.S,fp_cosinus.S,fp_flashconst.S,
fp_inverse.S,fp_m_inf.S,fp_merge.S,fp_p_inf.S,fp_powerseries.S,
fp_split.S,fplib.inc,isinfnan.S,readme.dtostre,readme.fplib,
readme.strtod,strtod.S}: Removed.
* libm/fplib/{asmdef.h,copysign.S,fdim.S,fixsfdi.S,floatdisf.S,
floatunsdisf.S,fma.S,fmax.S,fmin.S,fp32def.h,fp_arccos.S,fp_inf.S,
fp_mintl.S,fp_mpack.S,fp_negdi.S,fp_norm2.S,fp_powser.S, fp_powsodd.S,
fp_pscA.S,fp_pscB.S,fp_rempio2.S,fp_round.S,fp_sinus.S,fp_split3.S,
fp_trunc.S,hypot.S,inverse.S,isfinite.S,isinf.S,isnan.S,ntz.h,
signbit.S,trunc.S}: New files.
* libm/fplib/{Files.am,acos.S,addsf3.S,addsf3x.S,asin.S,atan.S,
atan2.S,ceil.S,cos.S,cosh.S,divsf3.S,divsf3x.S,exp.S,fixsfsi.S,
floatsisf.S,floor.S,fmod.S,fp_cmp.S,fp_nan.S,fp_zero.S,frexp.S,
ldexp.S,log.S,log10.S,modf.S,mulsf3.S,mulsf3x.S,negsf2.S,pow.S,
sin.S,sinh.S,sqrt.S,square.S,tan.S,tanh.S}: Replaced.
* libc/stdlib/{atof.S,dtoa_conv.h,dtoa_prf.c,dtostre.c,dtostrf.c,
ftoa_engine.S,ftoa_engine.h,strtod.c}: New files.
* libc/stdlib/Files.am: A set of new files added.
* include/math.h: A set of new functions added.
* include/stdlib.h (dtostrf): doc is corrected

So, indeed, the floating point library that most avr-gcc users are linking against has recently been revamped, and it is entirely possible that things have sped up as a result.

However, I find it very hard to believe that it ended up being 40% faster overall than IAR's current (performance optimized DLIB) implementation.

I think you hit the nail on the head with the assertion that the OP is using a version of IAR which is too old to allow for a fair comparison. (I'm fairly certain that none of the AVRs available in 2001 supported the MUL instruction. Certainly the ATmega103 did not.)

- Luke

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Then GCC was re-configured and magically became 40% faster.

Lee

I think I can offer a few possible explanations for that one. ...

Bad wording on my part. Drop the magic; then something like "Then GCC was properly configured and became 40% faster."

Quote:

Previous comparisons that you (Lee) have cited were performed well over a year ago.

I don't see that I cited anything (dig up links to past threads, etc.) in this thread. And I'm not stirring up a Compiler Wars. I'd just be VERY surprised if the float +-*/ primitives such as used in OP's test program would vary in performance more than single-digit percentages for the mainstream/mature AVR C compilers. Yeah, an extensive test suite might end up with BrandX being 7% faster than the average, and BrandY 4% larger, and BrandZ might end up the "best". But 40% is VERY hard to swallow, if properly targeted and configured. Perhaps one brand might have a whiz-bang clever asin() or whatever routine, but I wouldn't expect the primitive arithmetic routines to be that much different in speed and/or size.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

My IAR compiler 2.26 lets you select devices as ATmega8 and ATmega16 that have HW multiplier so it seems Lee is wrong when says there WAS no MUL that days.

A quick search gave me this result
http://www.maxim-ic.com/appnotes...
It is rather interesting, referring to the 4.10 version it says:

Quote:
The ATmega8 was selected for study because the current IAR compiler could generate code using the hardware multiplier for this microcontroller. The IAR compiler could not do so for the other AVR devices like the ATmega64 or ATmega128.

Then I recompiled my test with the ATmega8 option and the results were unchanged (no MUL).
Maybe IAR has written only a general multiplatform floating point library and has added new optimizations for some devices by the way.

Franco

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If it lets you select Mega16 it is newer than 2001. Mega163, perhaps sometime in 2001. While IAR will obviously get chip samples early, common folk couldn't hold a Mega8 in their hands until 2002.

Anyway, whether IAR chose to release that to customers at whatever version is history. The point is that you are "benchmarking" with a very old generation IAR against a modern version of GCC. The cycle counts in your comments are quite impressive for GCC: 160 counts per FP multiply, for example.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well I just downloaded the latest IAR EW Kickstart (because the one I had previously did not "know" the 644 and 644P) then I took your code from the first post, configured a project for mega644, left all other settings at default (except I set it to generate a secondary output file from the linker in Intel extended hex format). When I load that .hex flie it's built into AVR Studio and look at the addresses that the .map file identifies as being:

  LIBRARY MODULE, NAME : ?S_EC_MUL_L02

  SEGMENTS IN THE MODULE
  ======================
CODE
  Relative segment, address: CODE 00000712 - 00000721 (0x10 bytes), align: 1
  Segment part 0.
           ENTRY                   ADDRESS         REF BY
           =====                   =======         ======
           ?S_EC_MUL_L02           00000712        main (main)

then I find this (remember that 712 above is a byte address and the following are word addresses):

+00000389:   9F14        MUL     R17,R20          
+0000038A:   2D10        MOV     R17,R0           
+0000038B:   9F05        MUL     R16,R21          
+0000038C:   0D10        ADD     R17,R0           
+0000038D:   9F04        MUL     R16,R20          
+0000038E:   2D00        MOV     R16,R0           
+0000038F:   0D11        ADD     R17,R1           
+00000390:   9508        RET

I have to say that I'm pretty astonished that someone on the cusp of 2007/2008 is using a tool dating from 2001 and how you could possibly hope to make a meaningful comparison of that with a 2007 tool I simply haven't the first idea.

Bottom line is that the current IAR uses hardware MUL instructions.

As the Kickstart versions will generate up to 4K of code and your example program fits I am at a total loss to understand why you don't download the 2007 version and make your comparisons using that.

But at least you managed to waste a few hours of my fellow posters time here and keep us entertained for a while!