Time taken for a mathematical operation

96 posts / 0 new
Author
Message

Hi!!
I have a problem regarding the time taken for executing a math operation like sine.
Suppose i use the code like:
//the code
y=sin(x);
//where y,x are in float type

how many clock cycles will it take if i set the clock as 1 MHz?
i amusing AVR gcc for programming the micro
I am a newbie and will be very glad for your help!

Use the simulator.

Leon Heller G1HSM

leon_heller wrote:
Use the simulator.

He has to write the code before he can use the simulator.

"I may make you feel but I can't make you think" - Jethro Tull - Thick As A Brick

"void transmigratus(void) {transmigratus();} // recursio infinitus" - larryvc

"It's much more practical to rely on the processing powers of the real debugger, i.e. the one between the keyboard and chair." - JW wek3

"When you arise in the morning think of what a privilege it is to be alive: to breathe, to think, to enjoy, to love." -  Marcus Aurelius

simulator in the sense-AVR studio?

Of course!

Leon Heller G1HSM

I know that search is a bit bad in this forum but somewhere there some numbers how fast the different C compilers are for this kind of things.
I think that sin(x) is about 1000-2000 clk. (for a mega AVR, and way slower on a tiny)
Be aware that the speed depends of the value of x.

Thanks for the help!!

The fact is that the sin() algorithm is not a trivial one. Lookup tables are not used, so the sine must be calculated
somehow. One solution is solving the Taylors series for sin(x).
Ie. sin(x)= 1- (x^3)/3! +(x^5)/5!- (x^7)/7! and so forth until the required accuracy is reached. Other trig functions can be calculated from the trigonometric identities. Sorry no magic, it takes time!
To find out how long it takes, just use the C maths libraries and do the simulation using AVRStudio as suggested.
If you are only interested in a few sines or less accuracy, you could use a lookup table followed by some linear interpolation.

Charles Darwin, Lord Kelvin & Murphy are always lurking about!
Lee -.-
Riddle me this...How did the serpent move around before the fall?

In this:

```#include
#include

float f;

int main(void) {
f = sin(PINB / 180 * 3.1415926);
}```

The call to sin() alone (not the parameter setup) takes 1012 cycles.

I had to use a complicated parameter because even if I use:

```   f = sin(45.0 / 180 * 3.1415926);
```

the GCC optimiser is so smart it calculates the entire thing at compile time and just generates code to put the fixed result into 'f'.

what math lib , and compiler settings are you using?
with -0s and the standard lib (I never use float) the sine takes about 24000 clk!
(AS4 simulator2 with a mega128 project).

somehow i always thinking of code optimization during development. We can always decide, if you need speed then you loss the space (hence using look up table), and vice versa with the space. If the look up table size is bigger than the trigonometry algorithm then don't use look up table, still if you need more processing speed you can choose faster controller

KISS - Keep It Simple Stupid!

If your application has strict timing requirement, beware that some implementations of math routines have non-deterministic timing; there might be inputs where the execution time is much, much longer than normal.

An article on DIY trig functions:
http://www.ganssle.com/articles/atrig.htm

The choice here is to use float at all, sin() are small if float mul allready is used, but the program clawson show is about 1600 byte!
here are some numbers:
sin(0) 33 clk
sin(0.1) 3974clk
sin(1) 4135clk
sin(2) 4176clk
sin(3) 3962clk
sin(100) 4605clk

edit:
Now I know why the numbers are high, the sin(), (that's the float mul) don't use the HW multiplyer so this is the numbers for a tiny AVR!
I will find out why!

Quote:
the GCC optimiser is so smart it calculates the entire thing at compile time

Is it? How does it know the sin() is a trigonometric function, and knows the code for it if the sin() code (definitely float*float) was written for AVR in assembler?

It seems to me AVR GCC needs to have two codes for sin() to do that. The first one is an AVR implementation (mul, mulsu and such) and the second one is a host implementation to calculate sin(pi/34) and substitute a constant value into the AVR's hex.

Or perhaps it has only one implementation and the compiler itself loops through AVR asm code calculating/simulating it?

By the way, I know it is off-forum, but it is on-topic.

No RSTDISBL, no fun!

Quote:
It seems to me AVR GCC needs to have two codes for sin() to do that. The first one is an AVR implementation (mul, mulsu and such) and the second one is a host implementation to calculate sin(pi/34) and substitute a constant value into the AVR's hex.

Yes. Why not. Id does that with other numeric computations, so why not with trig?

Quote:
How does it know the sin() is a trigonometric function

By looking at it's name?

As of January 15, 2018, Site fix-up work has begun! Now do your part and report any bugs or deficiencies here

No guarantees, but if we don't report problems they won't get much of  a chance to be fixed! Details/discussions at link given just above.

"Some questions have no answers."[C Baird] "There comes a point where the spoon-feeding has to stop and the independent thinking has to start." [C Lawson] "There are always ways to disagree, without being disagreeable."[E Weddington] "Words represent concepts. Use the wrong words, communicate the wrong concept." [J Morin] "Persistence only goes so far if you set yourself up for failure." [Kartman]

vageesh wrote:
how many clock cycles will it take if i set the clock as 1 MHz?
... the same number as for all clock frequencies ...

Ross McKenzie ValuSoft Melbourne Australia

Quote:

and the standard lib

BIG MISTAKE! You must use libm.a with avr-gcc if you use !
Quote:
but the program clawson show is about 1600 byte

Not when libm.a is used:

```Size after:
AVR Memory Usage
----------------
Device: atmega168

Program:    1202 bytes (7.3% Full)

Quote:
It seems to me AVR GCC needs to have two codes for sin() to do that. The first one is an AVR implementation (mul, mulsu and such) and the second one is a host implementation to calculate sin(pi/34) and substitute a constant value into the AVR's hex.

It clearly does:

```00000090 :
#include

float f;

int main(void) {
f = sin(45.0 / 180 * 3.1415926);
90:	83 ef       	ldi	r24, 0xF3	; 243
92:	94 e0       	ldi	r25, 0x04	; 4
94:	a5 e3       	ldi	r26, 0x35	; 53
96:	bf e3       	ldi	r27, 0x3F	; 63
98:	80 93 00 01 	sts	0x0100, r24
9c:	90 93 01 01 	sts	0x0101, r25
a0:	a0 93 02 01 	sts	0x0102, r26
a4:	b0 93 03 01 	sts	0x0103, r27```

Obviously the sin() in this case is done by the 8086 program using 8086 library code - that leads to a numeric constant then the AVR code generator just generates code to load this constant.

Last Edited: Sat. Jul 2, 2011 - 02:58 PM

Several decades ago I wondered how long a routine took. My solution was to call the routine 10 times in a loop. I figure that divides the loop overhead by 10. Then you run the loop 10,000 or 100,000 times and you can time it with your watch. I assume you are a Young Guy and you might not have ever seen one of those obsolete gizmos, but engineers over 50 used to all have one on their wrist at all times during the day to see the time at a glance.
OK, I looked up my AVR reaults... at 18.432MHz, I get 60,000 fp mults or adds a sec and 2000 sins a sec. Divide that by your clk rate. I'll post the prog if anyone wants to see it.

Imagecraft compiler user

Quote:

and 2000 sins a sec

That is 9,216 cycles. GCC did the call to sin() in the following in 1,629:

```#include
#include

float f;

int main(void) {
PORTB = 39;
f = sin(PORTB / 180.0 * 3.1415926);
}```

(again I had to use PORTB to prevent the aggressive optimiser)

(Did I mention the other day I had some reservations about ICC?)

Except I don't believe any extrapolated results from a simulator. If you want to run a couple 1000 sins and actually measure the time, I'll believe that. I didn;t know you had to reserve a copy of the Imagecraft compiler. I always got my copy right when I ordered it. Maybe because I'm such a Good Customer?

Imagecraft compiler user

Quote:

Except I don't believe any extrapolated results from a simulator. If you want to run a couple 1000 sins

I'd trust the simulator in AVR Studio 4 in this respect. I've never seen it mis-calculate clock cycles.

I agree that we need to run many call to sin, mainly because of what Jayjay argued. Some trig implementation have varying execution times depending on the argument passed. It just might be that 39 degrees is a case fitting avrlibc extremely well.

If we want to measure the time for the trig only, and not get other stuff (i.e. the test-bench involved) then the simulator is an excellent instrument. Running a test e.g. 10K times on a real chip and measure by looking at a LED and your wristwatch makes it very hard to discern between the time for the 10K loop and the trig call proper. And the discussion will

How about whole degrees, every degree, from 0 to 180? Span of time-measuring is from point of call to point of return? Most meaningful unit to report results in is clock cycles? Data is a table with two columns: Degree value and clock cycles?

I'll smack it into Excel or OO Calc and make a diagram.

As of January 15, 2018, Site fix-up work has begun! Now do your part and report any bugs or deficiencies here

No guarantees, but if we don't report problems they won't get much of  a chance to be fixed! Details/discussions at link given just above.

"Some questions have no answers."[C Baird] "There comes a point where the spoon-feeding has to stop and the independent thinking has to start." [C Lawson] "There are always ways to disagree, without being disagreeable."[E Weddington] "Words represent concepts. Use the wrong words, communicate the wrong concept." [J Morin] "Persistence only goes so far if you set yourself up for failure." [Kartman]

Bob,

This took 13s on a mega16 in an STK500 clocked using the STK500 at 3.6864MHz:

```#include
#include

float f;

int main(void) {
uint16_t i;
DDRC = 0xFF;
for (i=0; i<20000; i++) {
PORTB = i;
f = sin(PORTB / 180.0 * 3.1415926);
}
PORTC = 0xFF;
while(1);
}
```

(the PORTC lights went out at the end of the period). Normalised for 1MHz that would be 47.9s. At 18.432MHz it would have been 2.6s

Note that I'm involving PORTB to force the sin() to be called. It also means that the /180 and *3.1415926 are being done each time too (or rather that will be one division by 57.3 each time), before the sin(). I used the iterator as the input value to get over the "39 happens to be a good value" effect (if present).

EDIT: thinking about it I don't need PORTB or even the conversion from degrees to radians. This works:

```#include
#include

float f;

int main(void) {
uint16_t i;
DDRC = 0xFF;
for (i=0; i<20000; i++) {
f = sin(i);
}
PORTC = 0xFF;
while(1);
} ```

That completes in 6.1s. So 22.48s at 1MHz or 1.22s at your 18.432MHz

I put the call 10 times in the loop to divide out the loop overhead. Your result was 819 per sec. Did I quote 2000 per sec? Ready to reserve your own copy? Nyuk Nyuk.

Imagecraft compiler user

Last Edited: Sat. Jul 2, 2011 - 03:46 PM

Quote:

I put the call 10 times in the loop to divide out the loop overhead.

Pointless it's about 10 cycles amongst 1000 or more.

EDIT but OK I tried:

```   for (i=0; i<20000; i++) {
f = sin(i);
f = sin(i);
f = sin(i);
f = sin(i);
f = sin(i);
f = sin(i);
f = sin(i);
f = sin(i);
f = sin(i);
f = sin(i);
}
```

Sadly the overall time was the same. I'm afraid it's a constant battle against the GCC optimiser it obviously recognises there's no point doing it 10 times as it's only used once. Perhaps if I make 'f' volatile - but that's going to introduce more overhead that the loop iteration...

Here's the version that times the loop overhead, then subtracts it out after adding the sins. I use tabs=2 and it lines up nice.

Attachment(s):

Imagecraft compiler user

I tried this:

```   for (i=0; i<20000; i++) {
asm(""::);
}
```

(the central statement generates no code but leads to the loop being generated).

Sadly the led was on for so short a time I couldn't measure it - a fraction of a second. So I tried:

```   uint32_t i;
for (i=0; i<2000000; i++) {
asm(""::);
}
```

That took 6 seconds. So the loop overhead previously was 1/100th of this - that is 0.06s - in fact the above is using less efficient uint32_t so it is quicker. Like I said, triival.

Quote:

Except I don't believe any extrapolated results from a simulator.

Bob, we've gone round and round on this. With real numbers. Comparing compilers. ImageCraft did not come out well. When the same topic was revisted some months later, your results magically got 10x faster. When challenged on this, there was no reply.

Now, this was some years back, and there is a note from ImageCraft on a FP re-do. But read all of both threads and then, Bob, tell us which steroids you have been feeding your compiler.

IIRC "fpbench" was re-visited again, with GCC numbers as well. Searching... Can't seem to find the thread I was looking for, but this one has links to a lot of prior discussions:
https://www.avrfreaks.net/index.p...

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Meta-meta: I find it interesting that you determine the loop overhead for an empty loop and then assume that the compiler will produce the same loop code (or at least the same overhead) when you have stuff in the loop.

Did you verify this (e.g. by looking at the generated code)?

As of January 15, 2018, Site fix-up work has begun! Now do your part and report any bugs or deficiencies here

No guarantees, but if we don't report problems they won't get much of  a chance to be fixed! Details/discussions at link given just above.

"Some questions have no answers."[C Baird] "There comes a point where the spoon-feeding has to stop and the independent thinking has to start." [C Lawson] "There are always ways to disagree, without being disagreeable."[E Weddington] "Words represent concepts. Use the wrong words, communicate the wrong concept." [J Morin] "Persistence only goes so far if you set yourself up for failure." [Kartman]

Yeah But LT... doesn't it make sense to run the tests every decade or so to keep the compiler writers honest? The Real Interesting thing I remember about the argument was that a bunch of y=mx+b calcs using longs was 6x faster than the same calcs using floats on the AVR, but only 2x faster on my android (ARM so it has 32 bit regs but no hw fp)

Imagecraft compiler user

list file for above c file I think....

Attachment(s):

Imagecraft compiler user

So with so much time taken for computation, is the following idea ok?:
create a array with value of sine function for a single cycle with a certain resolution and use the values and some interpolation
For me memory is not a issue

Read the posts. It depends on how fast you need it.

If your design is a solar tracker and you need one sin(x) every 10 minutes, then your LUT is not ok. Just use sin(x) and do not worry about the time.

But if you tried your algorithm in a simulator and from it you get AVR must be clocked at 321MHz, then you can consider a LUT (or better an ARM with FPU).

No RSTDISBL, no fun!

Well I've taken Bob's rats nest of a code scattered with literal constants, use of compiler specific functionality and assumptions about processor speed and tried to make it work with GCC but failed. Hopefully someone else has more patience than I. The point I was actually going to make was that my compiler was going to throw away most of the pointless code in the loops so I would get near 0 timing on many of the loops. But I'm afraid I can't make it that far.

Does anyone have a sensible benchmark that is cross-platform we could all try?

Quote:

Yeah But LT... doesn't it make sense to run the tests every decade or so to keep the compiler writers honest?

Certainly. But I still want to know why your code mysteriously got 2x faster. You never chose to address that. You choose not to address it now. Which number that you are spewing, based on your watch rather than counting cycles, are you touting now?
Quote:

OK, I looked up my AVR reaults... at 18.432MHz, I get 60,000 fp mults or adds a sec and 2000 sins a sec. Divide that by your clk rate. I'll post the prog if anyone wants to see it.

29-July-2004 https://www.avrfreaks.net/index.p... You claimed 985 sin()/second. The thread was nebulous about the AVR clock speed, but the code indicates 14.7MHz. That would make about 1234 at 18.4MHz.

17-October-2006 https://www.avrfreaks.net/index.p... You claimed 2023 sin()/second. Twice that of the previous run, but an unspecified clock speed.

So the steroids have made your test complete in half the time. (If you really do care

Quote:

to run the tests every decade or so to keep the compiler writers honest

then why haven't you given current numbers rather than going back to the numbers from the 2006 run?

Note that if you re-run and post the test results again, I'm guessing the numbers are still going to be a fraction of those for CV, unless ImageCraft did indeed rework FP primitives and library. I suspect that is why you don't want to take this up. But in that case, the dare/challenge is confusing.

Yes, Bob, rule-of-thumb guidelines as well as "rough mental math" to check sanity is very useful. But don't come and spew supposedly hard numbers without something to back it up and then post inconsistent results.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

The code is written for a mega32 isn't it - how can you run that at 18.4MHz anyway?

(Personally I'm happy running benchmarks at 3.6864MHz just because it is baud friendly - at the end of the day they have to be normalised to 1MHz anyway or perhaps better is simply to calculate CPU cycles - you can't put a double speed crystal on the AVr and suddenly claim my compiler is twice as efficient you know!)

Perhaps you needed to use a high F_CPU to reduce the timer interrupt overhead?

Do we all even agree that the following from Bob's code (ignore the whacky timing mechanisms) are the valid things to test anyway?

```  while(c != 'q'){
n=1000;
t1=gettics();
for(i=0; i1 meg)\n",n);
t1=gettics();
for(i=0; i < n; i++){
memcpy(buf2,buf1,sizeof(buf1)); //took 5 sec for 10 megs... about 2 megs/sec
}
t2=gettics();
dt=t2-t1; //5usec tics
secs=dt*secpertic;
flops=n/(8*secs); //K/sec
//  	dumpstack();
printf("%lu tics  %#8.6f secs %#7.1f K/sec\n",dt,secs,flops);
```

If so then forget seconds - we need a solution that will work across compilers to count cycles in each.

(For me the obvious solution is simulator 2 in AS4 - YMMV)

Itchy trigger finger Cliff. The upcoming spitfire flight is getting to you!

Charles Darwin, Lord Kelvin & Murphy are always lurking about!
Lee -.-
Riddle me this...How did the serpent move around before the fall?

Quote:

If so then forget seconds - we need a solution that will work across compilers to count cycles in each.

(For me the obvious solution is simulator 2 in AS4 - YMMV)

Quote:

Itchy trigger finger Cliff. The upcoming spitfire flight is getting to you!

As that is (probably) what I used in the past go-round, I'll oblige. Now, Cliff--does Bob's Dinosaur-Of_Choice Mega32 have Sim2 support?

So, we need to agree on a target AVR model. I don't THINK it will make any difference Mega128-Mega32-Mega8-Mega88-Mega324-Mega640, but let's pick one.

And since Cliff and I want to abstract from AVR clock rate, let's abstract to operations per second per Megahertz. That means with simulator ticks choosing a 1MHz clock means the simulator gives us our numbers almost directly.

So, Cliff, pick an AVR model supported by Sim2. IIRC the latest 'Studio I have loaded is 4.18 but I'd be happy to upgrade if there is a hue and cry.

Now, Bob doesn't care to acknowledge references to what he has posted/claimed in the past, so I'll re-post my adapted FPBENCH that I used before. It is similar to your approach, Cliff, except I have removed all the printf() stuff to make more friendly for an arbitrary toolchain setup. I used the simulator ticks from breakpoint to breakpoint. (You'll probably need to make "c" volatile so it will not disappear.)

```//file fpbench.c
//test avr flops
//Mar 4 2003 Bob G (bobgardner@aol.com) compile w iccavr 6.27
//Jan 22 04 Bob G compile with iccavr 6.30
//Feb 4 04 Bob G add timer
//July 29 04 Bob G add y=mx+b

// Ripped up by Lee Theusch  29 July 2004 CV version, AVRStudio
// Assume 14.745600MHz crystal

/*
Ms.		Net		Ops/sec
FP Mult		61.78	25.09	39857
FP Mult1	60.74	24.05	41580
FP Mult0	39.81	3.12	320513
FP Div		70.49	33.8	29586
FP sin()	318.88	282.19	3544
FP log()	299.26	262.57	3809
LONG y=mx*b	11.94	-24.75	-40404	83752
FP y=mx+b	70.22	33.53	29824
BLOCK Move	559.42	522.73	1913
*/

#include
#include
#include

#define INTR_OFF() asm("CLI")
#define INTR_ON()  asm("SEI")

//-----globals----------
unsigned int tofs;
unsigned long t1,t2;
unsigned long dt,ohdt,net;
int n,i,j;
float x,y,z,m,b;
long ix,iy,iz,im,ib;
float sec,flops;
char buf1[1024],buf2[1024];

//----------------
void main(void){
//fpbench main program
char c;

c=0;
n=1000;

c = 1; // a spot to set a breakpoint
for(i=0; i```

Results are in the linked thread. I will repeat, but it is a holiday weekend here for us Yanks and my remote-control appears to be hung up.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

clawson wrote:
Does anyone have a sensible benchmark that is cross-platform we could all try?
CoreMark
Recent Atmel additions to its results database are a UC3A3 and AP7000.
IIRC, XMEGA may have some cycle count improvements;
if so, someone running CoreMark on a XMEGA A-series would put it on CoreMark's map.
But, CoreMark does not directly answer the OP's questions.

"Dare to be naïve." - Buckminster Fuller

Quote:

I will repeat,...

Bob sez it is a "good thing" to repeat his FPBENCH every decade.

Good thing too--CV has apparently improved FP multiplication (though the overhead loop lost some cycles). Below is the source file, and two sets of results for differnt compiler versions.

AVRstudio 4.18. SIM2. Mega128 target. (Mega644 has same results.)

Since these are hard results numbers from a repeatable test. Bob will remain strangely silent until he pulls his next rule-of-thumb number out of the air, inconsistent with any he had given before.

Lee

```Loop Repetitions
1000                    CodeVision 1.25.9                       CodeVision 2.04.5

GROSS   NET     CYCLES/ OPS/            GROSS   NET     CYCLES/ OPS/
TEST            CYCLES  CYCLES  OP      MHz             CYCLES  CYCLES  OP      MHz
ADD             643475  108469  108     9219            671475  116469  116     8586
MULT           1367988  832982  833     1201            752512  197506  198     5063
MULT1           729859  194853  195     5132            749859  194853  195     5132
MULT0           581006   46000   46    21739            601006   46000   46    21739
DIV            1289682  754676  755     1325           1317682  762676  763     1311
SIN            3699126 3164120 3164      316           3783126 3228120 3228      310
LOG            4119284 3584278 3584      279           4099324 3544318 3544      282
SLOPE (long)    170022 -364984  170     5882            170022 -384984  170     5882
SLOPE (float)   849968  314962  315     3175            868966  313960  314     3185```
```#include
#include
#include

#define INTR_OFF() asm("CLI")
#define INTR_ON()  asm("SEI")

//-----globals----------
float x,y,z,m,b;
volatile char c;
long ix,iy,im,ib;
char buf1[1024],buf2[1024];

//----------------
void main(void){
//fpbench main program
int n,i,j;

c=0;
n=1000;

c = 1; // a spot to set a breakpoint
for(i=0; i```
```//cliff added this to widen the thread so the table looks right
=========================================================================================```

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Sheesh. It wasn't a conspiracy. I ran the test with a timer tic enabled, and Holy Smokes! If I run the timer tic real fast, my benchmark slows down. My intention was to measure the execution time of the test loop, subtract the overhead of the for loop and assignments, and printf the results. If I run it with fpbench >foo I scoop the results off to a file. The reason it looks like I kept cuttin and pastin extra cases on the end, was that I was cuttin and pastin cases on the end. I have run it on a 16mhz mega32 and on my hotwired MT128 where I changed the xtal to 18.432. I think the general results are: fp mult and add have cost 'X', divs are slower, sins and logs are slower. I was curious about whether 'short cycling' a mult by 0 or by 1 helped. I ran this exact same bench on all my pcs since the XT, and it wasnt till the pentium that fp mult by 0 and 1 got speeded up.

I post a program that calculates ops per sec, and you guys change it to ops per mhz? Cliff says my program is a rat's nest, full of literal constants and compiler dependencies. Man this is a tough crowd. If he had posted a file full of stuff like __PROGMEM__ and I gave a big rant about compiler dependent stuff, the shoe would be on the other foot. Do you object to running the timer to collect the timing info? Could I make you happier by adding an ops/mhz result to my ops per sec results?

OK, here's my results from my 16MHz mega32, using 100usec interrupt.

Attachment(s):

Imagecraft compiler user

Last Edited: Tue. Jul 5, 2011 - 07:37 PM

Quote:

The reason it looks like I kept cuttin and pastin extra cases on the end, was that I was cuttin and pastin cases on the end.

That's fine.

You still have never answered how your numbers magically got twice as fast from one post to the next.

You've responded with your "sheesh", but no numbers. As a Wise Sage opined:

Quote:
Yeah But LT... doesn't it make sense to run the tests every decade or so to keep the compiler writers honest?

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

I'm getting 40 flops per ms at 16mhz, that extrapolates to 46 flops per ms at 18.432. I have a theory that if I reduce the timer tic freq I can get faster results. But in the Real World, the timer will be ticking, so maybe that 40 flops per ms is the new revised Real and True BobG Rule of Thumb for fp overhead.

Imagecraft compiler user

bobgardner wrote:

so maybe that 40 flops per ms is the new revised Real and True BobG Rule of Thumb for fp overhead.

"Rule of Thumb"? ARM? :lol:

"I may make you feel but I can't make you think" - Jethro Tull - Thick As A Brick

"void transmigratus(void) {transmigratus();} // recursio infinitus" - larryvc

"It's much more practical to rely on the processing powers of the real debugger, i.e. the one between the keyboard and chair." - JW wek3

"When you arise in the morning think of what a privilege it is to be alive: to breathe, to think, to enjoy, to love." -  Marcus Aurelius

Try creative approaches specific to the 8-bit microcontroller. I suggest using tables and putting the tables into a serial EEPROM. Then use TWI to access the sine table for an angle from the table in EEPROM. EEPROMs for this application cost only about \$1 USD.

Try avoiding float and assume that the 32-bit result needs to be shifted to make it less than one, as as sine value is. Try using sines that are derived from angles based on a circle that has 256 degrees instead of 360. These equations might be able to be done by creative bit shifting, instead of float routine calls.

Don't use precision that is beyond the real-world needs of the specific application, also.

Try creative approaches specific to the 8-bit microcontroller. I suggest using tables and putting the tables into a serial EEPROM. Then use TWI to access the sine table for an angle from the table in EEPROM. EEPROMs for this application cost only about \$1 USD.

Try avoiding float and assume that the 32-bit result needs to be shifted to make it less than one, as as sine value is. Try using sines that are derived from angles based on a circle that has 256 degrees instead of 360. These equations might be able to be done by creative bit shifting, instead of float routine calls.

Don't use precision that is beyond the real-world needs of the specific application, also.

Quote:

Try creative approaches specific to the 8-bit microcontroller. I suggest using tables and putting the tables into a serial EEPROM. Then use TWI to access the sine table for an angle from the table in EEPROM. EEPROMs for this application cost only about \$1 USD.

Gotta be slower. Now, might be a solution for the OP, bit I'm in the middle of Bob-Bashing (tm).

Taking Bob's latest results, which are from ImageCraft version 7.23, and normalizing to OPS/MHz (by dividing the FLOPS number by 16 'cause the tests were run at 16MHz):

```        CodeVision 2.04.5       ImageCraft ver. 7.23
OPS/                    OPS/
TEST    MHz                     MHz

MULT     5063                    2765
MULT1    5132                    2500
MULT0   21739                    5208
DIV      1311                     993
SIN       310                      97
LOG       282                     108
```

GCC, IAR, Rowley anyone?

I'd speculate IAR and Rowley will beat CV in most cases. By how much? And GCC will be somewhere between the CV and ImageCraft numbers.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Last Edited: Tue. Jul 5, 2011 - 09:32 PM

Last V7 release was 7.23. You didnt use my 'wacky timing mechanism' I assume. So my timings include the 100usec timer interrupt servicing.

Imagecraft compiler user

Quote:

So my timings include the 100usec timer interrupt servicing.

Not my problem. But anyway, if your "soft timer" has a resolution of 100us, then just setting up timer1 at /1024 gives you a 4+ second reach with a resolution of 64us. Using /256 gives 1+second reach and 16us resolution and covers all but your longest tests.

So why not just use free-running timer1, and capture the TCNT value at each end of a test run? Better resolution and no interference during the test.

Version edited above.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Last Edited: Tue. Jul 5, 2011 - 09:48 PM

In case I have missed the plot, please can someone post some authorative code (or a link).

I am happy to massage it into something portable. The important thing is that anyone can compile and run immediately with any AVR or compiler.

Having got a bog-standard, you can make valid comparisons.
Likewise, you can post compiler-specific tweaks and the resultant effects.

Meanwhile the current number bandying is fairly pointless.

I would have guessed that there was little difference between compiler libraries. e.g. +-30%.
I would also guess ImageCraft, CV, Rowley, GCC, IAR.
I doubt that anyone would notice any performance difference.

David.

Quote:

I am happy to massage it into something portable. The important thing is that anyone can compile and run immediately with any AVR or compiler.

Well, I posted what I used above. Cliff also did something very similar.

I don't know if you can get entirely portable, given that standard chip-include names are different. Hmmm--maybe chip-include isn't needed unless you self-time like Bob does but then you actually have to >>run<< on the same platform.

Quote:

Meanwhile the current number bandying is fairly pointless.

Quote:

I doubt that anyone would notice any performance difference.

David, I tend to rail against Bob's Grand Pronouncements. The come out as ex cathedra but usually when you did down the Emperor has no clothes.

From the earlier links, you can see that this has been going on since 2004 (on this exact topic). I gave my reasons for protesting when I posted the links.

No-one notices? Unless you measure. The numbers above show CV 2x or more in most tests. (As would be expected unless ImageCraft had a re-write in the past few years.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

OK, Lee. What I did since the last time we talked this afternoon (How bout that Casey Anthony gettin away with murder in Orlando anyway?). I changed to using timer1 with ps=3 so its counting 1us clocks, but more importantly, its only interrupting every 65536 usecs, and my results are Real Close to yours. Guess that 100usec interrupt was really loading it down. And of course, there's nuthin up my sleeve. I'll ul the c source if someone calls me out. Its all the same except the timer and gettics.

```fpbench July 5 11 Bob Gardner

9987 tics  0.009987 secs

15540 tics  5553 net  0.005553 secs 180082.9 flops 11518

begin 1000 fp mults
15413 tics  5426 net  0.005426 secs  184297.8 flops

begin 1000 fp mults by 1
15986 tics  5999 net  0.005999 secs  166694.5 flops

begin 1000 fp mults by 0
12847 tics  2860 net  0.002860 secs 349650.4 flops 21853

begin 1000 fp divs
25081 tics  15094 net  0.015094 secs 66251.5 flops

begin 1000 fp div by 1
25788 tics  15801 net  0.015801 secs 63287.1 flops

begin 1000 sin
164625 tics  154638 net  0.154638 secs  6466.7 sins/sec 404

begin 1000 log
148240 tics  138253 net  0.138253 secs  7233.1 logs/sec

begin 1000 sqrt
134488 tics  124501 net  0.124501 secs  8032.1 sqrts/sec

begin 1000 pow
352751 tics  342764 net  0.342764 secs  2917.5 pows/sec

begin 1000 y=mx+b using longs
3704 tics  0.003704 secs 269978.4 y=mx+b/sec

begin 1000 y=mx+b using floats
22595 tics  0.022595 secs 44257.6 y=mx+b/sec

begin 8000 1/8k block moves (1000K->1 meg)
132876 tics  0.132876 secs  7525.8 K/sec
done. any char to repeat...

```

Wednesday: these are bogus results because I was calculating the timer count wrong. Disregard.

Imagecraft compiler user

Last Edited: Wed. Jul 6, 2011 - 02:24 PM

Quote:

Well, I posted what I used above. Cliff also did something very similar.

David,

If you are going to do it use what Lee posted as the baseline. It would be good if there was a across the compilers version - Lee's code is very close.

HOWEVER note that the GCC optimiser is going to throw away almost everything as it's full of pointless loops. I guess the solution is to make all the accessed variables 'volatile' but the question then arises whether this is a valid benchmark as you would not typically do that in a "real" app and you'd let the optimiser throw away anything that looked pointless. So it'll be showing GCC in a reduced light. The alternative is to not make them volatile, let the optimiser discard them and we can all sit back and say "isn't GCC amazing - it's incredibly fast!" ;-)

It doesn't recompile the math libs when you compile with optimization off, so just just run it with no optimization. The loops are what evens out the data dependency in the calcs. I thought it was Real Clever.

Imagecraft compiler user

I guess I can't get my brain around how one could build a benchmark without optimization but I take your point about the optimized lib code.

Quote:

I guess the solution is to make all the accessed variables 'volatile' but the question then arises whether this is a valid benchmark as you would not typically do that in a "real" app and you'd let the optimiser throw away anything that looked pointless. So it'll be showing GCC in a reduced light. The alternative is to not make them volatile, let the optimiser discard them and we can all sit back and say "isn't GCC amazing - it's incredibly fast!" Wink

I thought of that a bit. One could make a dummy assignment of the "result" of the last calculation in the loop to a dummy volatile variable after the loop is over. Now, if the GCC optimizer is way clever it would just make the last run through the loop ...but that would be WAY clever.

As benchmarks go, this exercise lacks--agreed. But as David said

Quote:
Meanwhile the current number bandying is fairly pointless.
:twisted: That said, people here would be interested in the relative cost of FP primitives and representative functions such as sin() and log() and ... .

Besides the practical value of knowing the costs in various toolchains, a round of Compiler Wars is a welcome summer diversion.

Oh, BTW, speaking of number-bandying...Bob said

Quote:
I changed to using timer1 with ps=3 so its counting 1us clocks, but more importantly, its only interrupting every 65536 usecs, and my results are Real Close to yours.

Now, from the earlier posts I got the impression that Bob was running on a Mega32 at 16MHz. (some earlier indications were a Mega128, but no matter in what follows...)

1us per timer tick, eh? /16 prescaler? 16-bit timers on Mega32 (or Mega128) don't have a /16 prescaler option. For a "ps=3" that is /64 so your final results would then be 1/4 the operations per second that the printout showed? I think the steroids are kicking in again. [David, that is exactly why I keep saying that the Emperor has no clothes...]

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Quote:

Now, from the earlier posts I got the impression that Bob was running on a Mega32 at 16MHz.

I was led to believe it was 18.432MHz actually running a mega32 out of spec.

Quote:
OK, here's my results from my 16MHz mega32, using 100usec interrupt.

resultsJuly5.txt

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Yet the .c he posted and I did battle with seemed to be setting the UART and timers as if it were 18.432. What's more it was hard-coded numbers rather than something calculated from F_CPU. I do hope Bob remembers to update those numbers each time he changes clock speed (though I guess the UART one will "bite").

I appreciate the feedback. Having several sets of eyes look at the timer init keeps the numbers correct. My previous slow results were on a mega32, 16mhz xtal, and timer0 interrupting every 100usec on a compare. The newer faster results were the result of using timer 1 with a 3 in the ps selector and ovf int enabled. How fast does that one count? 250KHz 4usec? If so, I guess those last results were 4x too big. Sorry. F_CPU is a gcc thing. Doesnt help me. I pick 16mhz from a pulldown in the IDE and run the AppBuilder. That's what spits out the timer init numbers. I guess you have to trust it or check up on it?

Imagecraft compiler user

Quote:

F_CPU is a gcc thing.

No it isn't. Any AVR program on any C compiler can do:

`#define F_CPU 1843200UL`

if it chooses to. It doesn't have to be F_CPU - that particular name just happens to be the one the AVR-LibC authors chose way back when. Call it:

`#define MY_CLOCK_IS_RUNNING_AT 1843200UL`

if you prefer but unless your build environment already provides such a macro then define one is a "Good Idea(tm)" because when you change the crystal or the CKSEL fuses you change one number in one place for the entire project and all your timers and UARTs and anything else using numbers based on the core CPU speed fall into line.

Quote:

That's what spits out the timer init numbers

With a comment I hope! But it's not the best solution for the reason I just gave. Having one system wide defined macro from which everything is derived (at compile time) makes it less likely that you forget to fix up one of the calculations when you change the clock.

Taking Bob's latest test run numbers and correcting for the 4us timer tick, the results end up entirely consistent with his earlier numbers with the fast timer tick. (Servicing the fast timer tick was <10% effect.)

```        CodeVision 2.04.5       ImageCraft 7.23       ImageCraft using timer
OPS/                    OPS/            4us      NET     CYCLES/ OPS/
TEST    MHz                     MHz             Ticks    CYCLES  OP      MHz

ADD      8586                    2693             5553   355392    355   2814
MULT     5063                    2765             5426   347264    347   2880
MULT1    5132                    2500             5999   383936    384   2605
MULT0   21739                    5208             2860   183040    183   5463
DIV      1311                     993            15094   966016    966   1035
SIN       310                      97           154638  9896832   9897    101
LOG       282                     108           138253  8848192   8848    113
```
`===========================cliffwozere==========================================`

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Quote:

Taking Bob's latest test run numbers

That wouldn't be the set that now contains the edit:
Quote:
Wednesday: these are bogus results because I was calculating the timer count wrong. Disregard.

would it?

Quote:

Quote:

Taking Bob's latest test run numbers

That wouldn't be the set that now contains the edit:
Quote:
Wednesday: these are bogus results because I was calculating the timer count wrong. Disregard.

would it?

Yes, it is. As I mentioned, I accounted for the 4us tick and assumed the tick count was correct. After letting the spreadsheet grind the results are entirely consistent with earlier normalized results. And indicate that rather than the steroids taking effect overnight to give a 4x boost, it was just PhotoShop.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

So CV is 3X faster. I'll be darned. I guess I get 45 flops per ms and you get 137 flops per ms at 16MHz. I only get 1 or 2 sines per ms, you get 3X that. I hope this info gives someone an idea of how to account for fp calc time. Thanks for checking my numbers.

Imagecraft compiler user

These are some results in a tabular form. I agree with Lee. It is only sensible when normalised to OPS/MHz.
Note that the GCC code is only really comparable with 'volatile' global variables. Even so, f-p operations are heavily dependent on the actual values being used.

```Hello CodeVision v2.04.9a running @16.000MHz
cycles     secs     flops  OPS/MHz [iter] operation
581541 0.036346   27513.1   1719.6 [1000] overhead loops
698114 0.043632   22918.9   1432.4 [1000] fp adds
781203 0.048825   20481.2   1280.1 [1000] fp mults
786550 0.049159   20342.0   1271.4 [1000] fp mults by 1
637593 0.039850   25094.4   1568.4 [1000] fp mults by 0
1279253 0.079953   12507.3    781.7 [1000] fp divs
1289148 0.080572   12411.3    775.7 [1000] fp div by 1
4011333 0.250708    3988.7    249.3 [1000] sin
4352420 0.272026    3676.1    229.8 [1000] log
8834207 0.552138    1811.1    113.2 [1000] sqrt
8310156 0.519385    1925.4    120.3 [1000] pow
172245 0.010765   92890.9   5805.7 [1000] y=mx+b using longs
895764 0.055985   17861.8   1116.4 [1000] y=mx+b using floats
10329858 0.645616       1.5      0.1 [   1] memcpy 1MB

Hello GCC 20100110 running @16.000MHz
cycles     secs     flops  OPS/MHz [iter] operation
213522 0.013345   74933.7   4683.4 [1000] overhead loops
249239 0.015577   64195.4   4012.2 [1000] fp adds
277591 0.017349   57638.8   3602.4 [1000] fp mults
147925 0.009245  108162.9   6760.2 [1000] fp mults by 1
282588 0.017662   56619.5   3538.7 [1000] fp mults by 0
684928 0.042808   23360.1   1460.0 [1000] fp divs
147925 0.009245  108162.9   6760.2 [1000] fp div by 1
1983584 0.123974    8066.2    504.1 [1000] sin
2477275 0.154830    6458.7    403.7 [1000] log
688736 0.043046   23231.0   1451.9 [1000] sqrt
557606 0.034850   28694.1   1793.4 [1000] pow
173771 0.010861   92075.2   5754.7 [1000] y=mx+b using longs
405890 0.025368   39419.5   2463.7 [1000] y=mx+b using floats
8002770 0.500173       2.0      0.1 [   1] memcpy 1MB

Hello GCC 20100110 volatile running @16.000MHz
cycles     secs     flops  OPS/MHz [iter] operation
203926 0.012745   78459.8   4903.7 [1000] overhead loops
245643 0.015353   65135.2   4070.9 [1000] fp adds
339592 0.021225   47115.4   2944.7 [1000] fp mults
203926 0.012745   78459.8   4903.7 [1000] fp mults by 1
338589 0.021162   47254.9   2953.4 [1000] fp mults by 0
681332 0.042583   23483.4   1467.7 [1000] fp divs
269523 0.016845   59364.1   3710.3 [1000] fp div by 1
2041591 0.127599    7837.0    489.8 [1000] sin
2469685 0.154355    6478.6    404.9 [1000] log
746743 0.046671   21426.4   1339.1 [1000] sqrt
617610 0.038601   25906.3   1619.1 [1000] pow
218818 0.013676   73120.1   4570.0 [1000] y=mx+b using longs
465907 0.029119   34341.6   2146.4 [1000] y=mx+b using floats
8068306 0.504269       2.0      0.1 [   1] memcpy 1MB
```

This is my 'adjusted' version of Bob's code. I was running on a 16MHz mega168.

```#include
#include
// this header handles system includes and F_CPU, ISR() macros
#include "../scripts/portab_kbv.h"
#if defined(__CODEVISIONAVR__)
#define TIMER1_OVF_vect	TIM1_OVF
#define COMPILER "CodeVision v2.04.9a"
#elif defined(__IMAGECRAFT__)
#define TIMER1_OVF_vect	iv_TIM1_OVF
#define COMPILER "ImageCraft v7"
#elif defined(__GNUC__)
#define COMPILER "GCC 20100110 volatile"
#elif defined(__IAR_SYSTEMS_ICC__)
#define COMPILER "IAR 5.3"
#endif

//-----globals----------
unsigned int tofs;
unsigned long t1, t2;
unsigned long dt, ohdt, net;
volatile int n, i, j;
volatile float x, y, z, m, b;
volatile long ix, iy, iz, im, ib;
float sec, flops;
char buf1[32], buf2[32];         // tiny buffers cos m168 has little SRAM

#include

extern void initstdio(void);     // start up stdio to USART @ 9600 baud
float secs, secpertic = 1.0 / F_CPU;
unsigned long overflows;

unsigned long gettics(void)
{
unsigned long time;
CLI();
TCCR1B = 0;                  // stop timer
time = overflows + TCNT1L;
time += (TCNT1H<<8u);
TCCR1B = (1<<CS10);          // start again
SEI();                       // with overflows
return time;
}

ISR(TIMER1_OVF_vect)
{
overflows += 65536u;
}

// yes,  I know that there are function overheads for both report() and gettics()
// I could use a macro to stop/start Timer1 if anyone is worried.
void report(char *title)
{
t2 = gettics();
dt = t2 - t1;
net = dt - ohdt;
secs = net * secpertic;
flops = n / secs;
printf("% 8lu % 8.6f % 9.1f % 8.1f [% 4d] % s\r\n", net, secs, flops, flops * 1e6 / F_CPU, n, title);
}

void main(void)
{
//fpbench main program
volatile char c;
initstdio();
TCCR1B = (1<<CS10);
TIMSK1 = (1<<TOIE1);
SEI();
printf("\r\nHello " COMPILER " running @% 6.3fMHz\r\n", 0.000001 * F_CPU);
c = 0;
n = 1000;

while (c != 'q') {
n = 1000;
printf("  cycles     secs     flops  OPS/MHz [iter] operation\r\n");
t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = x;
}

t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = x + y;
}

t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = x * y;
}
report("fp mults");

t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = x * 1;
}
report("fp mults by 1");

t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = x * 0;
}
report("fp mults by 0");

t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = x / y;
}
report("fp divs");

t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = x / 1.0;        //this didnt work with 1
}
report("fp div by 1");

t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = sin(x);
}
report("sin");

t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = log(x);
}
report("log");

t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = sqrt(x);
}
report("sqrt");

t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = pow(x, .5);
}
report("pow");

t1 = gettics();
ib = 1;
for (i = 0; i < n; i++) {
j = n - i;
ix = i;
im = j;
iy = im * ix + ib;
}
report("y=mx+b using longs");

t1 = gettics();
b = 1.0;
for (i = 0; i < n; i++) {
j = n - i;
x = i;
m = j;
y = m * x + b;
}
report("y=mx+b using floats");

n = 1;
t1 = gettics();
for (i = 0; i < n; i++) {
for (j = 0; j < 32767; j++) {
memcpy(buf2, buf1, sizeof(buf1));   //took 5 sec for 10 megs... about 2 megs/sec
}
}
report("memcpy 1MB");
c = 'q';
}
}
```

Yes. It would be very wise to re-write the 'tests' to confound any optimiser.
No. I cannot compile with either IAR or ImageCraft cos I only have evaluation versions.

David.

Hmmm--I'll have to poke at this, and see why your CV numbers don't mesh with mine using the simulator.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Personally, I would reset the timer before each test. You run a certain risk of granularity with the overflow ISR(). i.e. if your function just happens to start or finish at an overflow.

I also see no great point in counting to the nearest cycle since you are already doing 1000 iterations.

In fact I have always just timed a single operation. Providing you stop the timer immediately, you can take as long as you like to process any results.

No. I cannot understand the difference in results. I am fairly confident that I have my calculations ok. OTOH, the timed 'sequences' are fairly pointless. I also compiled for a mega128. The is about the same. Stupid loops or arithmetic are slightly different.

```
Hello CodeVision v2.04.9a running @16.000MHz
cycles     secs     flops  OPS/MHz [iter] operation
709154 0.044322   22562.1   1410.1 [1000] overhead loops
694603 0.043413   23034.7   1439.7 [1000] fp adds
777692 0.048606   20573.7   1285.9 [1000] fp mults
848627 0.053039   18854.0   1178.4 [1000] fp mults by 1
765206 0.047825   20909.4   1306.8 [1000] fp mults by 0
1341330 0.083833   11928.5    745.5 [1000] fp divs
1416761 0.088548   11293.4    705.8 [1000] fp div by 1
4073410 0.254588    3927.9    245.5 [1000] sin
4348909 0.271807    3679.1    229.9 [1000] log
8896284 0.556018    1798.5    112.4 [1000] sqrt
8241109 0.515069    1941.5    121.3 [1000] pow
242322 0.015145   66027.8   4113.2 [1000] y=mx+b using longs
965841 0.060365   16565.9   1035.4 [1000] y=mx+b using floats
10887433 0.680465       1.5      0.1 [   1] memcpy 1MB
```

Of course Pavel may have changed the compiler between versions. Both my mega128 and mga128 are small model, min size, max optimisation.

David.

Quote:

765206 0.047825 20909.4 1306.8 [1000] fp mults by 0

Besides just the general differences by a factor between your results and mine with the simulator (no >>wonder<< Bob doesn't like to use the simulator :twisted: ), this particular one makes no sense. In all my runs since 2004 on this "fpbench" it appeared as though MULT0 was a special case, and 4x the normal mult. Your results are [roughly] the same as ans even worse than "normal".

Gonna have to set this up...

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

I am going down to the pub.

I might change the 't1= gettics()' to 'stop();TCNT1=0;overflows=0;start();'
This would alleviate the overflows. (as would prescaling the timer)
Likewise, what is the point of 1000 iterations?
Nothing overflows with n=1.

I may re-arrange tomorrow so that I can compile with 4k IAR or ICC evaluations.

David.

The loop across n is the clever part I invented. See how it multiplies 1 x 999 then 2 x 998 then 3 by 997? This eliminates/averages out the speed dependency on data.

Imagecraft compiler user

David--

Didja get this warning--

Quote:
Warning: C:\AtmelC\dp.c(101): overflow is possible in 16 bit shift left, casting shifted operand to 'long' may be required

`    time += (TCNT1H<<8u);`

First run results below, and are consistent with your CV results. Mega1280 running at 7.3728MHz, using USART1 in RS485 mode. ;) Now, why are they so far off from my cycle-counting...?

```Hello CodeVision v2.04.5 running @ 7.373MHz
cycles     secs     flops  OPS/MHz [iter] operation
589535 0.079961   12506.1   1696.3 [ 1000] overhead loops
772646 0.104797    9542.3   1294.3 [ 1000] fp adds
788199 0.106906    9354.0   1268.7 [ 1000] fp mults
785546 0.106546    9385.6   1273.0 [ 1000] fp mults by 1
636589 0.086343   11581.7   1570.9 [ 1000] fp mults by 0
1288249 0.174730    5723.1    776.2 [ 1000] fp divs
1355680 0.183876    5438.5    737.6 [ 1000] fp div by 1
3821205 0.518284    1929.4    261.7 [ 1000] sin
4137663 0.561206    1781.9    241.7 [ 1000] log
8836203 1.198487     834.4    113.2 [ 1000] sqrt
7769100 1.053752     949.0    128.7 [ 1000] pow
204239 0.027702   36098.9   4896.2 [ 1000] y=mx+b using longs
839168 0.113819    8785.8   1191.7 [ 1000] y=mx+b using floats
10559504 1.432224       0.7      0.1 [   1] memcpy 1MB
```

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

OK, it makes more sense now. I think you are missing a line--"ohdt" was never set so the counts are the overhead >>plus<< the time for the operations.

```        report("overhead loops");
ohdt = dt;
```

...and then I get

```Hello CodeVision v2.04.5 running @ 7.373MHz
cycles     secs     flops  OPS/MHz [iter] operation
589535 0.079961   12506.1   1696.3 [ 1000] overhead loops
183111 0.024836   40264.1   5461.2 [ 1000] fp adds
198664 0.026946   37111.9   5033.6 [ 1000] fp mults
196011 0.026586   37614.2   5101.8 [ 1000] fp mults by 1
4294948762 582.539672       1.7      0.2 [ 1000] fp mults by 0
698714 0.094769   10552.0   1431.2 [ 1000] fp divs
700609 0.095026   10523.4   1427.3 [ 1000] fp div by 1
3297258 0.447219    2236.0    303.3 [ 1000] sin
3548128 0.481246    2077.9    281.8 [ 1000] log
8246668 1.118526     894.0    121.3 [ 1000] sqrt
7179565 0.973791    1026.9    139.3 [ 1000] pow
4294582000 582.489929       1.7      0.2 [ 1000] y=mx+b using longs
249633 0.033859   29534.6   4005.9 [ 1000] y=mx+b using floats
9969969 1.352264       0.7      0.1 [   1] memcpy 1MB```

The anomalies happen when e.g. "using longs" the time ends up less than the overhead. Got to look at that MULT0 a bit.

But now I don't really care too much. The numbers in general for the operations are consistent with the ones I posted earlier.

Did the GCC version have the overhead subtracted?

`=========================cliffwozere====================================`

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Thanks for the comparison work folks. I think we are getting good results that we can quote with confidence now. The original question was about how to get a sin function either faster or more accurately (cant have both!). In the lo res mono chrome graphics I've done, rotating a box on a 240x160 screen looked ok with a 1 degree table, each table entry was a signed int, representing a fraction 0 to .9999. The SG workstations running opengl used a tenth degree table, giving 3600 angles per circle, but they had plenty of ram. If the Original Poster reads this, I hope it answers the original question.

Imagecraft compiler user

Quote:

In the lo res mono chrome graphics I've done, rotating a box on a 240x160 screen looked ok with a 1 degree table, each table entry was a signed int, representing a fraction 0 to .9999. The SG workstations running opengl used a tenth degree table, giving 3600 angles per circle, but they had plenty of ram. If the Original Poster reads this, I hope it answers the original question.

I'd have to dig out some >>real<< old stuff which I don't think I have anymore, but what works quite well for rotating a wire-frame is to examine the transformation equation and look for simplification. IIRC the key was that the cos of the rotation angle was an important part of that equation, and cos and cos*cos of a small angle is so close to 1 that you just use 1 and then the calculations get much faster.

When you finally stop and settle, then you have the time to redraw it "right".

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

Quote:
OK, it makes more sense now. I think you are missing a line--"ohdt" was never set so the counts are the overhead >>plus<< the time for the operations.

Yup. That is the cause of the anomaly. I ignored the 'loop overhead' which is actually a significant proportion of the execution time.

IMHO, you need to devise tests that do not rely on subtracting two similar times. And quite honestly you can time an individual C statement easily. Just start() and stop() the Timer1 (effectively SBI and CBI). You can still accumulate the times for several input values, and of course separate functions and automatic casts.

The extended loop idea is best suited to a low-resolution system timer. Years ago, I used to profile 68000 programs using just a 200Hz system tick.

Since you can count AVR cycles exactly with Timer1, there is no need to do the statistical approach.

These results are obtained by adding the 'ohdt=dt;' loop overhead statement, and removing the 'z = x;' from the overhead. This makes the ohdt subtraction give +ve answers except in the 'y=mx+b using longs' test.

```
Hello GCC 20100110 volatile running @16.000MHz
cycles     secs     flops  OPS/MHz [iter] operation
187921 0.011745   85142.2   5321.4 [1000] overhead loops
123314 0.007707  129750.1   8109.4 [1000] fp adds
151666 0.009479  105495.0   6593.4 [1000] fp mults
16061 0.001004  996201.9  62262.6 [1000] fp mults by 1
85066 0.005317  188089.3  11755.6 [1000] fp mults by 0
493406 0.030838   32427.7   2026.7 [1000] fp divs
16000 0.001000  999999.9  62500.0 [1000] fp div by 1
1853665 0.115854    8631.5    539.5 [1000] sin
2347356 0.146710    6816.2    426.0 [1000] log
493220 0.030826   32439.9   2027.5 [1000] sqrt
495220 0.030951   32308.9   2019.3 [1000] pow
4294932591 268.433290       3.7      0.2 [1000] y=mx+b using longs
277981 0.017374   57557.9   3597.4 [1000] y=mx+b using floats
7880441 0.492528       2.0      0.1 [   1] memcpy 1MB

Hello ImageCraft v8.02 running @    16MHz
cycles     secs     flops  OPS/MHz [iter] operation
654909 0.040932   24430.9   1526.9 [1000] overhead loops
372786 0.023299   42920.1   2682.5 [1000] fp adds
364626 0.022789   43880.6   2742.5 [1000] fp mults
401324 0.025083     39868   2491.8 [1000] fp mults by 1
200217 0.012514   79913.3   4994.6 [1000] fp mults by 0
984073 0.061505     16259   1016.2 [1000] fp divs
1028390 0.064274   15558.3    972.4 [1000] fp div by 1
8149031 0.509314    1963.4    122.7 [1000] sin
7693782 0.480861    2079.6      130 [1000] log
6830648 0.426915    2342.4    146.4 [1000] sqrt
18978351 1.186147     843.1     52.7 [1000] pow
4294580773 268.411254       3.7      0.2 [1000] y=mx+b using longs
824758 0.051547   19399.6   1212.5 [1000] y=mx+b using floats
9382970 0.586436       1.7      0.1 [   1] memcpy 1MB

Hello CodeVision v2.04.9a running @16.000MHz
cycles     secs     flops  OPS/MHz [iter] operation
611565 0.038223   26162.4   1635.1 [1000] overhead loops
148625 0.009289  107653.5   6728.3 [1000] fp adds
231714 0.014482   69050.6   4315.7 [1000] fp mults
237061 0.014816   67493.2   4218.3 [1000] fp mults by 1
88052 0.005503  181710.8  11356.9 [1000] fp mults by 0
795352 0.049709   20116.9   1257.3 [1000] fp divs
739607 0.046225   21633.1   1352.1 [1000] fp div by 1
3454844 0.215928    4631.2    289.4 [1000] sin
3730395 0.233150    4289.1    268.1 [1000] log
8277666 0.517354    1932.9    120.8 [1000] sqrt
7688079 0.480505    2081.1    130.1 [1000] pow
4294598052 268.412353       3.7      0.2 [1000] y=mx+b using longs
346280 0.021642   46205.4   2887.8 [1000] y=mx+b using floats
10275872 0.642242       1.6      0.1 [   1] memcpy 1MB
```

I downloaded the current ICCAVR v8. The comparison with avr-gcc is revealing! If anyone has an IAR licence, we could see how effective IAR is.

Quite honestly, these tests are not the best way of timing C library functions. OTOH, they satisfy Bob's methodology.

David.

But now, when someone says "I'm trying to run my autopilot at 200Hz, and it looks like I need to do about 300 fp ops. How can I tell if it busts my time budget?" we can tell him he can do 100 fp adds and mults per ms with GCC, so he has 3ms worth of fp calcs in his 5ms loop.

Imagecraft compiler user

Those results definitely confirm my reservations about ICC ;-)

I re-wrote the fpbench program to cycle-count in a more sensible way. i.e. count cycles for a specific sequence.

```#define INIT_TIMER()     TCCR1A = 0; TIMSK = (1<<TOIE1)
#define START_TIMER()    TCCR1B = (1<<CS10)
#define STOP_TIMER()     TCCR1B = 0
#define CYCLE_COUNT(seq) {START_TIMER(); seq; STOP_TIMER();}
...
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
CYCLE_COUNT(z = sqrt(x));
}
report("sqrt");

```

I have made no attempt to subtract the 'counting' overhead. As you can see, it is only 1 or 2 cycles per iteration anyway.

```Hello ImageCraft v8.02 running @    16MHz
cycles     secs     flops  OPS/MHz [iter] operation
2000 0.000125 8000000.5   500000 [1000] overhead loops
373523 0.023345   42835.4   2677.2 [1000] fp adds
365363 0.022835   43792.1     2737 [1000] fp mults
984502 0.061531   16251.9   1015.7 [1000] fp divs
7690263 0.480641    2080.6      130 [1000] log
6827695 0.426731    2343.4    146.5 [1000] sqrt
18968368 1.185523     843.5     52.7 [1000] pow
197055 0.012316   81195.6   5074.7 [1000] y=mx+b using longs
825247 0.051578   19388.1   1211.8 [1000] y=mx+b using floats

Hello GCC 20100110 volatile running @16.000MHz
cycles     secs     flops  OPS/MHz [iter] operation
1000 0.000063 15999999.0 999999.9 [1000] overhead loops
124235 0.007765  128788.2   8049.3 [1000] fp adds
152630 0.009539  104828.7   6551.8 [1000] fp mults
494280 0.030893   32370.3   2023.1 [1000] fp divs
2347665 0.146729    6815.3    426.0 [1000] log
494094 0.030881   32382.5   2023.9 [1000] sqrt
496094 0.031006   32252.0   2015.7 [1000] pow
85043 0.005315  188140.1  11758.8 [1000] y=mx+b using longs
278897 0.017431   57368.9   3585.6 [1000] y=mx+b using floats

Hello CodeVision v2.04.9a running @16.000MHz
cycles     secs     flops  OPS/MHz [iter] operation
2000 0.000125 7999999.5 500000.0 [1000] overhead loops
142573 0.008911  112223.2   7014.0 [1000] fp adds
223662 0.013979   71536.5   4471.0 [1000] fp mults
789300 0.049331   20271.1   1266.9 [1000] fp divs
3716299 0.232269    4305.4    269.1 [1000] log
8271666 0.516979    1934.3    120.9 [1000] sqrt
7665039 0.479065    2087.4    130.5 [1000] pow
118052 0.007378  135533.5   8470.8 [1000] y=mx+b using longs
332208 0.020763   48162.6   3010.2 [1000] y=mx+b using floats
```

Note that you can 'count' several statements at once. Simply enclose argument within braces:

```     CYCLE_COUNT({j = n - i; ix = i; im = j;});
```

In all honesty, this is simpler than using the Simulator. You can 'correct' the cycle count if you are obsessive. You can also time real-life peripherals too. Of course it uses a 16-bit timer, but you can use an 8-bit timer with prescale. You are already counting the overflows. Just change the MACRO()s

The method works ok with JTAG too.

David.

It must be time to check the generated code!
a FP mul should not even take 50 clk!

I have never really bothered to trace compiler private functions.

From my results, the C statement 'z = x * y' takes 152 cycles with GCC on average.
Bear in mind that you have to push two f-p values onto the stack, evaluate an expression, then store back into the f-p variable's memory.

I can only suggest that you submit some better ASM code to the GCC team.

In the interests of ImageCraft and Codevision, you could offer it to Richard and Pavel too.

It would be nice to think that AVR compilers used the best possible internal math functions. OTOH, this requires some effort. Very few people use f-p, and when they do, it is seldom critical.

The first Z80, 68000, 8086 compilers never had optimum software f-p. Commercial competition ensured that vendors spent money on it.

Nowadays, PCs have hardware f-p. I doubt that anyone would bother writing software f-p from scratch today.

All the same, I would put f-p efficiency way down my list of priorities.

David.

I just find it sad that a 6502 BASIC program can do sin(x) in 40 bit format (that's without a HW MUL!) in about 10000 clk and the fastest AVR code (32bit FP) only is about 5 times faster!

Quote:
From my results, the C statement 'z = x * y' takes 152 cycles with GCC on average.

I don't hope so!
I assume that a big part of the time is to convert int to FP (do the normalisation take some time).
The max/min for the FP mul should be very close to the AVG unless it's a special case like 0 or 1, there is max 2 shifts in the normalisation of the result.

Quote:

only is about 5 times faster!

It's still just an 8bit micro and a RISC rather than a CISC - why would you expect it to be even as good as a 6502 let alone 5 times faster?

a 6502 is 1/2 mips pr. MHz.
no HW mul
and it was BASIC so a big overhead.

edit and the format was 32bit and not 24 bit for the mul itself.

I thought commercial compilers outperform the free ones in all aspects (except cost).

George.

@sparrow2,

All the algorithms for software f-p have been known for years. They require little change from one 8-bit CPU to another.

OTOH, I have no interest or inclination to write the required optimal ASM for an AVR core.

If I did have the inclination, then I am sure it is not too difficult. The AVR world is waiting for you to do the work.

The existing GCC source code is in the public domain. Why not improve it? (or re-write completely if applicable)

The 6502 CPU was a lovely chip to work with. OTOH, some instructions were fairly slow, especially with the more esoteric addressing modes. However it was the addressing modes that made it so nice to work with.

David.

I normally don't need FP, and the only times I have used it was because integer with the needed dynamic range was to slow.
And no I have no plans of makeing some general rutines that can be slowed down when used with a compiler.

And yes I sometimes use FP to show results in the correct SI units, and not ADC units etc. but then nobody cares about the speed.

Quote:
And no I have no plans of makeing some general rutines that can be slowed down when used with a compiler.

You know perfectly well that that statement is idiotic.

An internal math function has one or two f-p inputs and an f-p output. GCC passes the arguments in registers, and returns them in registers. I cannot think of any better arrangement.

So these functions are usable with any language or model.

If you think it is important to have faster ASM functions, it is equally applicable to anything else.

David.

Quote:
You know perfectly well that that statement is idiotic.

I can hear that you never have made any realtime programming, and show the lack of knowledge very well with this kind of comments!

ex. could be a IIR filter, that live in it's own registers

Quote:

You know perfectly well that that statement is idiotic.

An internal math function has one or two f-p inputs and an f-p output. GCC passes the arguments in registers, and returns them in registers. I cannot think of any better arrangement.

David try writing a test code in GCC that just does an fp * for example and you'll find that the compiler doesn't just call a single _fpmul but calls a stage of functions (can't remember the exact details but things like _fpnormalise, fptestsign, that kind of thing) so I think Sparrow is right. Even if he provided an optimised core _fpfinallydothemul it would be watered down by the preceding steps the compiler generates. I suspect you'd need to get to the route of Generic GCCs fp handling (not just he AVR support functions) to really have an impact.

And the format/structure of IEEE754 FP dosn't help, if you need speed keep it in a 5 byte structure.

I will have a look sometime. As a general rule, a multiplication or division does not require a normalise.

OTOH, addition and subtraction are relatively complicated. You have to normalise both before and after.

From distant memory, f-p on a 6502 used a larger 'unpacked' accumulator for intermediate expressions.

I will shut up for the moment. (until I have investigated some actual AVR code)

All the same, an arithmetic operation is always two inputs producing one output. Assignment of the eventual result to memory is a common task.

David.

Quote:

I will have a look sometime.

here's an example:

```volatile float f, g, h;

int main(void) {
f = g * h;
90:	60 91 04 01 	lds	r22, 0x0104
94:	70 91 05 01 	lds	r23, 0x0105
98:	80 91 06 01 	lds	r24, 0x0106
9c:	90 91 07 01 	lds	r25, 0x0107
a0:	20 91 08 01 	lds	r18, 0x0108
a4:	30 91 09 01 	lds	r19, 0x0109
a8:	40 91 0a 01 	lds	r20, 0x010A
ac:	50 91 0b 01 	lds	r21, 0x010B
b0:	0e 94 65 00 	call	0xca	; 0xca <__mulsf3>
b4:	60 93 00 01 	sts	0x0100, r22
b8:	70 93 01 01 	sts	0x0101, r23
bc:	80 93 02 01 	sts	0x0102, r24
c0:	90 93 03 01 	sts	0x0103, r25
c4:	80 e0       	ldi	r24, 0x00	; 0
c6:	90 e0       	ldi	r25, 0x00	; 0
c8:	08 95       	ret
```

So far so good if that called function were just doing the guts of the fpMul. However:

```000000ca <__mulsf3>:
ca:	0b d0       	rcall	.+22     	; 0xe2 <__mulsf3x>
cc:	78 c0       	rjmp	.+240    	; 0x1be <__fp_round>
ce:	69 d0       	rcall	.+210    	; 0x1a2 <__fp_pscA>
d0:	28 f0       	brcs	.+10     	; 0xdc <__mulsf3+0x12>
d2:	6e d0       	rcall	.+220    	; 0x1b0 <__fp_pscB>
d4:	18 f0       	brcs	.+6      	; 0xdc <__mulsf3+0x12>
d6:	95 23       	and	r25, r21
d8:	09 f0       	breq	.+2      	; 0xdc <__mulsf3+0x12>
da:	5a c0       	rjmp	.+180    	; 0x190 <__fp_inf>
dc:	5f c0       	rjmp	.+190    	; 0x19c <__fp_nan>
de:	11 24       	eor	r1, r1
e0:	a2 c0       	rjmp	.+324    	; 0x226 <__fp_szero>
```

I won't quote the body of all those but I think you get the idea?

`=============================================================================`

Quote:
I will have a look sometime. As a general rule, a multiplication or division does not require a normalise.

??? for the input yes but not for the result:

FP Mul is [0.5 .. 1.0[ * [0.5 .. 1.0[ = [0.25 .. 1[

so as a general rule you have normalise 50% of the time.

And for division the result can be worse!

Yup. I had a look at libc-1.4.6 source code which happened to be on my PC. Yours is much the same. i.e. it splits the float into an internal representation, does the muls3f3 and puts it back again.

I presume that it does complex expressions in the internal format. No doubt you would choose to use this internal representation yourself if you want to write some new f-p functions in ASM.

But at some stage you have to put the internal representation back into the SRAM memory. This applies to ASM programs too. (unless you are going to use 5 or 6 bytes storage for single-precision floats)

David.

If you want the ASM speed it's better to use a 2's complement FP, so it's faster to convert to and from "normal" integer numbers.

Quote:
unless you are going to use 5 or 6 bytes storage for single-precision floats

4 extra clk for load/store means nothing compared to the calc time !

You can use whatever format or size accumulator that you like for internal calculations.

To conform to the IEEE format for storage is only required at the beginning and end of a calculation.

Incidentally, I timed some IAR C statements. They are of a similar time to CV.

I would be interested in your opinions about achievable C times. There are effectively 3 components to a calculation:
1. unpacking into a suitable format for efficient calculation.
2. performing the calculation.
3. normalising and converting back to the IEEE storage format.

I would guess that (3) has the most scope for improvement.

David.