Time taken for a mathematical operation

Go To Last Post
96 posts / 0 new

Pages

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Well, I posted what I used above. Cliff also did something very similar.

David,

If you are going to do it use what Lee posted as the baseline. It would be good if there was a across the compilers version - Lee's code is very close.

HOWEVER note that the GCC optimiser is going to throw away almost everything as it's full of pointless loops. I guess the solution is to make all the accessed variables 'volatile' but the question then arises whether this is a valid benchmark as you would not typically do that in a "real" app and you'd let the optimiser throw away anything that looked pointless. So it'll be showing GCC in a reduced light. The alternative is to not make them volatile, let the optimiser discard them and we can all sit back and say "isn't GCC amazing - it's incredibly fast!" ;-)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It doesn't recompile the math libs when you compile with optimization off, so just just run it with no optimization. The loops are what evens out the data dependency in the calcs. I thought it was Real Clever.

Imagecraft compiler user

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I guess I can't get my brain around how one could build a benchmark without optimization but I take your point about the optimized lib code.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I guess the solution is to make all the accessed variables 'volatile' but the question then arises whether this is a valid benchmark as you would not typically do that in a "real" app and you'd let the optimiser throw away anything that looked pointless. So it'll be showing GCC in a reduced light. The alternative is to not make them volatile, let the optimiser discard them and we can all sit back and say "isn't GCC amazing - it's incredibly fast!" Wink

I thought of that a bit. One could make a dummy assignment of the "result" of the last calculation in the loop to a dummy volatile variable after the loop is over. Now, if the GCC optimizer is way clever it would just make the last run through the loop ...but that would be WAY clever.

As benchmarks go, this exercise lacks--agreed. But as David said

Quote:
Meanwhile the current number bandying is fairly pointless.
:twisted: That said, people here would be interested in the relative cost of FP primitives and representative functions such as sin() and log() and ... .

Besides the practical value of knowing the costs in various toolchains, a round of Compiler Wars is a welcome summer diversion.

Oh, BTW, speaking of number-bandying...Bob said

Quote:
I changed to using timer1 with ps=3 so its counting 1us clocks, but more importantly, its only interrupting every 65536 usecs, and my results are Real Close to yours.

Now, from the earlier posts I got the impression that Bob was running on a Mega32 at 16MHz. (some earlier indications were a Mega128, but no matter in what follows...)

1us per timer tick, eh? /16 prescaler? 16-bit timers on Mega32 (or Mega128) don't have a /16 prescaler option. For a "ps=3" that is /64 so your final results would then be 1/4 the operations per second that the printout showed? I think the steroids are kicking in again. [David, that is exactly why I keep saying that the Emperor has no clothes...]

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Now, from the earlier posts I got the impression that Bob was running on a Mega32 at 16MHz.

I was led to believe it was 18.432MHz actually running a mega32 out of spec.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
OK, here's my results from my 16MHz mega32, using 100usec interrupt.

resultsJuly5.txt

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yet the .c he posted and I did battle with seemed to be setting the UART and timers as if it were 18.432. What's more it was hard-coded numbers rather than something calculated from F_CPU. I do hope Bob remembers to update those numbers each time he changes clock speed (though I guess the UART one will "bite").

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I appreciate the feedback. Having several sets of eyes look at the timer init keeps the numbers correct. My previous slow results were on a mega32, 16mhz xtal, and timer0 interrupting every 100usec on a compare. The newer faster results were the result of using timer 1 with a 3 in the ps selector and ovf int enabled. How fast does that one count? 250KHz 4usec? If so, I guess those last results were 4x too big. Sorry. F_CPU is a gcc thing. Doesnt help me. I pick 16mhz from a pulldown in the IDE and run the AppBuilder. That's what spits out the timer init numbers. I guess you have to trust it or check up on it?

Imagecraft compiler user

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

F_CPU is a gcc thing.

No it isn't. Any AVR program on any C compiler can do:

#define F_CPU 1843200UL

if it chooses to. It doesn't have to be F_CPU - that particular name just happens to be the one the AVR-LibC authors chose way back when. Call it:

#define MY_CLOCK_IS_RUNNING_AT 1843200UL

if you prefer but unless your build environment already provides such a macro then define one is a "Good Idea(tm)" because when you change the crystal or the CKSEL fuses you change one number in one place for the entire project and all your timers and UARTs and anything else using numbers based on the core CPU speed fall into line.

Quote:

That's what spits out the timer init numbers

With a comment I hope! But it's not the best solution for the reason I just gave. Having one system wide defined macro from which everything is derived (at compile time) makes it less likely that you forget to fix up one of the calculations when you change the clock.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Taking Bob's latest test run numbers and correcting for the 4us timer tick, the results end up entirely consistent with his earlier numbers with the fast timer tick. (Servicing the fast timer tick was <10% effect.)

        CodeVision 2.04.5       ImageCraft 7.23       ImageCraft using timer
        OPS/                    OPS/            4us      NET     CYCLES/ OPS/
TEST    MHz                     MHz             Ticks    CYCLES  OP      MHz 
                                                                             
ADD      8586                    2693             5553   355392    355   2814
MULT     5063                    2765             5426   347264    347   2880
MULT1    5132                    2500             5999   383936    384   2605
MULT0   21739                    5208             2860   183040    183   5463
DIV      1311                     993            15094   966016    966   1035
SIN       310                      97           154638  9896832   9897    101
LOG       282                     108           138253  8848192   8848    113
===========================cliffwozere==========================================

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Taking Bob's latest test run numbers

That wouldn't be the set that now contains the edit:
Quote:
Wednesday: these are bogus results because I was calculating the timer count wrong. Disregard.

would it?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Quote:

Taking Bob's latest test run numbers

That wouldn't be the set that now contains the edit:
Quote:
Wednesday: these are bogus results because I was calculating the timer count wrong. Disregard.

would it?


Yes, it is. As I mentioned, I accounted for the 4us tick and assumed the tick count was correct. After letting the spreadsheet grind the results are entirely consistent with earlier normalized results. And indicate that rather than the steroids taking effect overnight to give a 4x boost, it was just PhotoShop.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

So CV is 3X faster. I'll be darned. I guess I get 45 flops per ms and you get 137 flops per ms at 16MHz. I only get 1 or 2 sines per ms, you get 3X that. I hope this info gives someone an idea of how to account for fp calc time. Thanks for checking my numbers.

Imagecraft compiler user

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

These are some results in a tabular form. I agree with Lee. It is only sensible when normalised to OPS/MHz.
Note that the GCC code is only really comparable with 'volatile' global variables. Even so, f-p operations are heavily dependent on the actual values being used.

Hello CodeVision v2.04.9a running @16.000MHz
  cycles     secs     flops  OPS/MHz [iter] operation
  581541 0.036346   27513.1   1719.6 [1000] overhead loops
  698114 0.043632   22918.9   1432.4 [1000] fp adds
  781203 0.048825   20481.2   1280.1 [1000] fp mults
  786550 0.049159   20342.0   1271.4 [1000] fp mults by 1
  637593 0.039850   25094.4   1568.4 [1000] fp mults by 0
 1279253 0.079953   12507.3    781.7 [1000] fp divs
 1289148 0.080572   12411.3    775.7 [1000] fp div by 1
 4011333 0.250708    3988.7    249.3 [1000] sin
 4352420 0.272026    3676.1    229.8 [1000] log
 8834207 0.552138    1811.1    113.2 [1000] sqrt
 8310156 0.519385    1925.4    120.3 [1000] pow
  172245 0.010765   92890.9   5805.7 [1000] y=mx+b using longs
  895764 0.055985   17861.8   1116.4 [1000] y=mx+b using floats
10329858 0.645616       1.5      0.1 [   1] memcpy 1MB

Hello GCC 20100110 running @16.000MHz
  cycles     secs     flops  OPS/MHz [iter] operation
  213522 0.013345   74933.7   4683.4 [1000] overhead loops
  249239 0.015577   64195.4   4012.2 [1000] fp adds
  277591 0.017349   57638.8   3602.4 [1000] fp mults
  147925 0.009245  108162.9   6760.2 [1000] fp mults by 1
  282588 0.017662   56619.5   3538.7 [1000] fp mults by 0
  684928 0.042808   23360.1   1460.0 [1000] fp divs
  147925 0.009245  108162.9   6760.2 [1000] fp div by 1
 1983584 0.123974    8066.2    504.1 [1000] sin
 2477275 0.154830    6458.7    403.7 [1000] log
  688736 0.043046   23231.0   1451.9 [1000] sqrt
  557606 0.034850   28694.1   1793.4 [1000] pow
  173771 0.010861   92075.2   5754.7 [1000] y=mx+b using longs
  405890 0.025368   39419.5   2463.7 [1000] y=mx+b using floats
 8002770 0.500173       2.0      0.1 [   1] memcpy 1MB

Hello GCC 20100110 volatile running @16.000MHz
  cycles     secs     flops  OPS/MHz [iter] operation
  203926 0.012745   78459.8   4903.7 [1000] overhead loops
  245643 0.015353   65135.2   4070.9 [1000] fp adds
  339592 0.021225   47115.4   2944.7 [1000] fp mults
  203926 0.012745   78459.8   4903.7 [1000] fp mults by 1
  338589 0.021162   47254.9   2953.4 [1000] fp mults by 0
  681332 0.042583   23483.4   1467.7 [1000] fp divs
  269523 0.016845   59364.1   3710.3 [1000] fp div by 1
 2041591 0.127599    7837.0    489.8 [1000] sin
 2469685 0.154355    6478.6    404.9 [1000] log
  746743 0.046671   21426.4   1339.1 [1000] sqrt
  617610 0.038601   25906.3   1619.1 [1000] pow
  218818 0.013676   73120.1   4570.0 [1000] y=mx+b using longs
  465907 0.029119   34341.6   2146.4 [1000] y=mx+b using floats
 8068306 0.504269       2.0      0.1 [   1] memcpy 1MB

This is my 'adjusted' version of Bob's code. I was running on a 16MHz mega168.

#include 
#include 
// this header handles system includes and F_CPU, ISR() macros
#include "../scripts/portab_kbv.h"
#if defined(__CODEVISIONAVR__)
#define TIMER1_OVF_vect	TIM1_OVF
#define COMPILER "CodeVision v2.04.9a"
#elif defined(__IMAGECRAFT__)
#define TIMER1_OVF_vect	iv_TIM1_OVF
#define COMPILER "ImageCraft v7"
#elif defined(__GNUC__)
#define COMPILER "GCC 20100110 volatile"
#elif defined(__IAR_SYSTEMS_ICC__)
#define COMPILER "IAR 5.3"
#endif

//-----globals----------
unsigned int tofs;
unsigned long t1, t2;
unsigned long dt, ohdt, net;
volatile int n, i, j;
volatile float x, y, z, m, b;
volatile long ix, iy, iz, im, ib;
float sec, flops;
char buf1[32], buf2[32];         // tiny buffers cos m168 has little SRAM

#include 

extern void initstdio(void);     // start up stdio to USART @ 9600 baud
float secs, secpertic = 1.0 / F_CPU;
unsigned long overflows;

unsigned long gettics(void)
{
    unsigned long time;
    CLI();
    TCCR1B = 0;                  // stop timer
    time = overflows + TCNT1L;
    time += (TCNT1H<<8u);
    TCCR1B = (1<<CS10);          // start again
    SEI();                       // with overflows
    return time;
}

ISR(TIMER1_OVF_vect)
{
    overflows += 65536u;
}

// yes,  I know that there are function overheads for both report() and gettics()
// I could use a macro to stop/start Timer1 if anyone is worried.
void report(char *title)
{
        t2 = gettics();
        dt = t2 - t1;
        net = dt - ohdt;
        secs = net * secpertic;
        flops = n / secs;
        printf("% 8lu % 8.6f % 9.1f % 8.1f [% 4d] % s\r\n", net, secs, flops, flops * 1e6 / F_CPU, n, title);
}

void main(void)
{
    //fpbench main program
    volatile char c;
    initstdio();
    TCCR1B = (1<<CS10);
    TIMSK1 = (1<<TOIE1);
    SEI();
    printf("\r\nHello " COMPILER " running @% 6.3fMHz\r\n", 0.000001 * F_CPU);
    c = 0;
    n = 1000;

    while (c != 'q') {
        n = 1000;
        printf("  cycles     secs     flops  OPS/MHz [iter] operation\r\n");
        t1 = gettics();
        for (i = 0; i < n; i++) {
            j = n - i;
            x = i;
            y = j;
            z = x;
        }
        report("overhead loops");

        t1 = gettics();
        for (i = 0; i < n; i++) {
            j = n - i;
            x = i;
            y = j;
            z = x + y;
        }
        report("fp adds");

        t1 = gettics();
        for (i = 0; i < n; i++) {
            j = n - i;
            x = i;
            y = j;
            z = x * y;
        }
        report("fp mults");

        t1 = gettics();
        for (i = 0; i < n; i++) {
            j = n - i;
            x = i;
            y = j;
            z = x * 1;
        }
        report("fp mults by 1");

        t1 = gettics();
        for (i = 0; i < n; i++) {
            j = n - i;
            x = i;
            y = j;
            z = x * 0;
        }
        report("fp mults by 0");

        t1 = gettics();
        for (i = 0; i < n; i++) {
            j = n - i;
            x = i;
            y = j;
            z = x / y;
        }
        report("fp divs");

        t1 = gettics();
        for (i = 0; i < n; i++) {
            j = n - i;
            x = i;
            y = j;
            z = x / 1.0;        //this didnt work with 1
        }
        report("fp div by 1");

        t1 = gettics();
        for (i = 0; i < n; i++) {
            j = n - i;
            x = i;
            y = j;
            z = sin(x);
        }
        report("sin");

        t1 = gettics();
        for (i = 0; i < n; i++) {
            j = n - i;
            x = i;
            y = j;
            z = log(x);
        }
        report("log");

        t1 = gettics();
        for (i = 0; i < n; i++) {
            j = n - i;
            x = i;
            y = j;
            z = sqrt(x);
        }
        report("sqrt");

        t1 = gettics();
        for (i = 0; i < n; i++) {
            j = n - i;
            x = i;
            y = j;
            z = pow(x, .5);
        }
        report("pow");

        t1 = gettics();
        ib = 1;
        for (i = 0; i < n; i++) {
            j = n - i;
            ix = i;
            im = j;
            iy = im * ix + ib;
        }
        report("y=mx+b using longs");

        t1 = gettics();
        b = 1.0;
        for (i = 0; i < n; i++) {
            j = n - i;
            x = i;
            m = j;
            y = m * x + b;
        }
        report("y=mx+b using floats");

        n = 1;
        t1 = gettics();
        for (i = 0; i < n; i++) {
            for (j = 0; j < 32767; j++) {
                memcpy(buf2, buf1, sizeof(buf1));   //took 5 sec for 10 megs... about 2 megs/sec
            }
        }
        report("memcpy 1MB");
        c = 'q';
    }
}

Yes. It would be very wise to re-write the 'tests' to confound any optimiser.
No. I cannot compile with either IAR or ImageCraft cos I only have evaluation versions.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hmmm--I'll have to poke at this, and see why your CV numbers don't mesh with mine using the simulator.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Personally, I would reset the timer before each test. You run a certain risk of granularity with the overflow ISR(). i.e. if your function just happens to start or finish at an overflow.

I also see no great point in counting to the nearest cycle since you are already doing 1000 iterations.

In fact I have always just timed a single operation. Providing you stop the timer immediately, you can take as long as you like to process any results.

No. I cannot understand the difference in results. I am fairly confident that I have my calculations ok. OTOH, the timed 'sequences' are fairly pointless. I also compiled for a mega128. The is about the same. Stupid loops or arithmetic are slightly different.


Hello CodeVision v2.04.9a running @16.000MHz
  cycles     secs     flops  OPS/MHz [iter] operation
  709154 0.044322   22562.1   1410.1 [1000] overhead loops
  694603 0.043413   23034.7   1439.7 [1000] fp adds
  777692 0.048606   20573.7   1285.9 [1000] fp mults
  848627 0.053039   18854.0   1178.4 [1000] fp mults by 1
  765206 0.047825   20909.4   1306.8 [1000] fp mults by 0
 1341330 0.083833   11928.5    745.5 [1000] fp divs
 1416761 0.088548   11293.4    705.8 [1000] fp div by 1
 4073410 0.254588    3927.9    245.5 [1000] sin
 4348909 0.271807    3679.1    229.9 [1000] log
 8896284 0.556018    1798.5    112.4 [1000] sqrt
 8241109 0.515069    1941.5    121.3 [1000] pow
  242322 0.015145   66027.8   4113.2 [1000] y=mx+b using longs
  965841 0.060365   16565.9   1035.4 [1000] y=mx+b using floats
10887433 0.680465       1.5      0.1 [   1] memcpy 1MB

Of course Pavel may have changed the compiler between versions. Both my mega128 and mga128 are small model, min size, max optimisation.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

765206 0.047825 20909.4 1306.8 [1000] fp mults by 0

Besides just the general differences by a factor between your results and mine with the simulator (no >>wonder<< Bob doesn't like to use the simulator :twisted: ), this particular one makes no sense. In all my runs since 2004 on this "fpbench" it appeared as though MULT0 was a special case, and 4x the normal mult. Your results are [roughly] the same as ans even worse than "normal".

Gonna have to set this up...

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I am going down to the pub.

I might change the 't1= gettics()' to 'stop();TCNT1=0;overflows=0;start();'
This would alleviate the overflows. (as would prescaling the timer)
Likewise, what is the point of 1000 iterations?
Nothing overflows with n=1.

I may re-arrange tomorrow so that I can compile with 4k IAR or ICC evaluations.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The loop across n is the clever part I invented. See how it multiplies 1 x 999 then 2 x 998 then 3 by 997? This eliminates/averages out the speed dependency on data.

Imagecraft compiler user

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

David--

Didja get this warning--

Quote:
Warning: C:\AtmelC\dp.c(101): overflow is possible in 16 bit shift left, casting shifted operand to 'long' may be required

    time += (TCNT1H<<8u);

First run results below, and are consistent with your CV results. Mega1280 running at 7.3728MHz, using USART1 in RS485 mode. ;) Now, why are they so far off from my cycle-counting...?

Hello CodeVision v2.04.5 running @ 7.373MHz
  cycles     secs     flops  OPS/MHz [iter] operation
  589535 0.079961   12506.1   1696.3 [ 1000] overhead loops
  772646 0.104797    9542.3   1294.3 [ 1000] fp adds
  788199 0.106906    9354.0   1268.7 [ 1000] fp mults
  785546 0.106546    9385.6   1273.0 [ 1000] fp mults by 1
  636589 0.086343   11581.7   1570.9 [ 1000] fp mults by 0
 1288249 0.174730    5723.1    776.2 [ 1000] fp divs
 1355680 0.183876    5438.5    737.6 [ 1000] fp div by 1
 3821205 0.518284    1929.4    261.7 [ 1000] sin
 4137663 0.561206    1781.9    241.7 [ 1000] log
 8836203 1.198487     834.4    113.2 [ 1000] sqrt
 7769100 1.053752     949.0    128.7 [ 1000] pow
  204239 0.027702   36098.9   4896.2 [ 1000] y=mx+b using longs
  839168 0.113819    8785.8   1191.7 [ 1000] y=mx+b using floats
10559504 1.432224       0.7      0.1 [   1] memcpy 1MB

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

OK, it makes more sense now. I think you are missing a line--"ohdt" was never set so the counts are the overhead >>plus<< the time for the operations.

        report("overhead loops");
        ohdt = dt;

...and then I get

Hello CodeVision v2.04.5 running @ 7.373MHz
  cycles     secs     flops  OPS/MHz [iter] operation
  589535 0.079961   12506.1   1696.3 [ 1000] overhead loops
  183111 0.024836   40264.1   5461.2 [ 1000] fp adds
  198664 0.026946   37111.9   5033.6 [ 1000] fp mults
  196011 0.026586   37614.2   5101.8 [ 1000] fp mults by 1
4294948762 582.539672       1.7      0.2 [ 1000] fp mults by 0
  698714 0.094769   10552.0   1431.2 [ 1000] fp divs
  700609 0.095026   10523.4   1427.3 [ 1000] fp div by 1
 3297258 0.447219    2236.0    303.3 [ 1000] sin
 3548128 0.481246    2077.9    281.8 [ 1000] log
 8246668 1.118526     894.0    121.3 [ 1000] sqrt
 7179565 0.973791    1026.9    139.3 [ 1000] pow
4294582000 582.489929       1.7      0.2 [ 1000] y=mx+b using longs
  249633 0.033859   29534.6   4005.9 [ 1000] y=mx+b using floats
 9969969 1.352264       0.7      0.1 [   1] memcpy 1MB

The anomalies happen when e.g. "using longs" the time ends up less than the overhead. Got to look at that MULT0 a bit.

But now I don't really care too much. The numbers in general for the operations are consistent with the ones I posted earlier.

Did the GCC version have the overhead subtracted?

=========================cliffwozere====================================

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks for the comparison work folks. I think we are getting good results that we can quote with confidence now. The original question was about how to get a sin function either faster or more accurately (cant have both!). In the lo res mono chrome graphics I've done, rotating a box on a 240x160 screen looked ok with a 1 degree table, each table entry was a signed int, representing a fraction 0 to .9999. The SG workstations running opengl used a tenth degree table, giving 3600 angles per circle, but they had plenty of ram. If the Original Poster reads this, I hope it answers the original question.

Imagecraft compiler user

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

In the lo res mono chrome graphics I've done, rotating a box on a 240x160 screen looked ok with a 1 degree table, each table entry was a signed int, representing a fraction 0 to .9999. The SG workstations running opengl used a tenth degree table, giving 3600 angles per circle, but they had plenty of ram. If the Original Poster reads this, I hope it answers the original question.

I'd have to dig out some >>real<< old stuff which I don't think I have anymore, but what works quite well for rotating a wire-frame is to examine the transformation equation and look for simplification. IIRC the key was that the cos of the rotation angle was an important part of that equation, and cos and cos*cos of a small angle is so close to 1 that you just use 1 and then the calculations get much faster.

When you finally stop and settle, then you have the time to redraw it "right".

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
OK, it makes more sense now. I think you are missing a line--"ohdt" was never set so the counts are the overhead >>plus<< the time for the operations.

Yup. That is the cause of the anomaly. I ignored the 'loop overhead' which is actually a significant proportion of the execution time.

IMHO, you need to devise tests that do not rely on subtracting two similar times. And quite honestly you can time an individual C statement easily. Just start() and stop() the Timer1 (effectively SBI and CBI). You can still accumulate the times for several input values, and of course separate functions and automatic casts.

The extended loop idea is best suited to a low-resolution system timer. Years ago, I used to profile 68000 programs using just a 200Hz system tick.

Since you can count AVR cycles exactly with Timer1, there is no need to do the statistical approach.

These results are obtained by adding the 'ohdt=dt;' loop overhead statement, and removing the 'z = x;' from the overhead. This makes the ohdt subtraction give +ve answers except in the 'y=mx+b using longs' test.


Hello GCC 20100110 volatile running @16.000MHz
  cycles     secs     flops  OPS/MHz [iter] operation
  187921 0.011745   85142.2   5321.4 [1000] overhead loops
  123314 0.007707  129750.1   8109.4 [1000] fp adds
  151666 0.009479  105495.0   6593.4 [1000] fp mults
   16061 0.001004  996201.9  62262.6 [1000] fp mults by 1
   85066 0.005317  188089.3  11755.6 [1000] fp mults by 0
  493406 0.030838   32427.7   2026.7 [1000] fp divs
   16000 0.001000  999999.9  62500.0 [1000] fp div by 1
 1853665 0.115854    8631.5    539.5 [1000] sin
 2347356 0.146710    6816.2    426.0 [1000] log
  493220 0.030826   32439.9   2027.5 [1000] sqrt
  495220 0.030951   32308.9   2019.3 [1000] pow
4294932591 268.433290       3.7      0.2 [1000] y=mx+b using longs
  277981 0.017374   57557.9   3597.4 [1000] y=mx+b using floats
 7880441 0.492528       2.0      0.1 [   1] memcpy 1MB

Hello ImageCraft v8.02 running @    16MHz
  cycles     secs     flops  OPS/MHz [iter] operation
  654909 0.040932   24430.9   1526.9 [1000] overhead loops
  372786 0.023299   42920.1   2682.5 [1000] fp adds
  364626 0.022789   43880.6   2742.5 [1000] fp mults
  401324 0.025083     39868   2491.8 [1000] fp mults by 1
  200217 0.012514   79913.3   4994.6 [1000] fp mults by 0
  984073 0.061505     16259   1016.2 [1000] fp divs
 1028390 0.064274   15558.3    972.4 [1000] fp div by 1
 8149031 0.509314    1963.4    122.7 [1000] sin
 7693782 0.480861    2079.6      130 [1000] log
 6830648 0.426915    2342.4    146.4 [1000] sqrt
18978351 1.186147     843.1     52.7 [1000] pow
4294580773 268.411254       3.7      0.2 [1000] y=mx+b using longs
  824758 0.051547   19399.6   1212.5 [1000] y=mx+b using floats
 9382970 0.586436       1.7      0.1 [   1] memcpy 1MB

Hello CodeVision v2.04.9a running @16.000MHz
  cycles     secs     flops  OPS/MHz [iter] operation
  611565 0.038223   26162.4   1635.1 [1000] overhead loops
  148625 0.009289  107653.5   6728.3 [1000] fp adds
  231714 0.014482   69050.6   4315.7 [1000] fp mults
  237061 0.014816   67493.2   4218.3 [1000] fp mults by 1
   88052 0.005503  181710.8  11356.9 [1000] fp mults by 0
  795352 0.049709   20116.9   1257.3 [1000] fp divs
  739607 0.046225   21633.1   1352.1 [1000] fp div by 1
 3454844 0.215928    4631.2    289.4 [1000] sin
 3730395 0.233150    4289.1    268.1 [1000] log
 8277666 0.517354    1932.9    120.8 [1000] sqrt
 7688079 0.480505    2081.1    130.1 [1000] pow
4294598052 268.412353       3.7      0.2 [1000] y=mx+b using longs
  346280 0.021642   46205.4   2887.8 [1000] y=mx+b using floats
10275872 0.642242       1.6      0.1 [   1] memcpy 1MB

I downloaded the current ICCAVR v8. The comparison with avr-gcc is revealing! If anyone has an IAR licence, we could see how effective IAR is.

Quite honestly, these tests are not the best way of timing C library functions. OTOH, they satisfy Bob's methodology.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

But now, when someone says "I'm trying to run my autopilot at 200Hz, and it looks like I need to do about 300 fp ops. How can I tell if it busts my time budget?" we can tell him he can do 100 fp adds and mults per ms with GCC, so he has 3ms worth of fp calcs in his 5ms loop.

Imagecraft compiler user

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Those results definitely confirm my reservations about ICC ;-)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I re-wrote the fpbench program to cycle-count in a more sensible way. i.e. count cycles for a specific sequence.

#define INIT_TIMER()     TCCR1A = 0; TIMSK = (1<<TOIE1)
#define START_TIMER()    TCCR1B = (1<<CS10)
#define STOP_TIMER()     TCCR1B = 0
#define CYCLE_COUNT(seq) {START_TIMER(); seq; STOP_TIMER();}
...
        for (i = 0; i < n; i++) {
            j = n - i;
            x = i;
            y = j;
            CYCLE_COUNT(z = sqrt(x));
        }
        report("sqrt");

I have made no attempt to subtract the 'counting' overhead. As you can see, it is only 1 or 2 cycles per iteration anyway.

Hello ImageCraft v8.02 running @    16MHz
  cycles     secs     flops  OPS/MHz [iter] operation
    2000 0.000125 8000000.5   500000 [1000] overhead loops
  373523 0.023345   42835.4   2677.2 [1000] fp adds
  365363 0.022835   43792.1     2737 [1000] fp mults
  984502 0.061531   16251.9   1015.7 [1000] fp divs
 7690263 0.480641    2080.6      130 [1000] log
 6827695 0.426731    2343.4    146.5 [1000] sqrt
18968368 1.185523     843.5     52.7 [1000] pow
  197055 0.012316   81195.6   5074.7 [1000] y=mx+b using longs
  825247 0.051578   19388.1   1211.8 [1000] y=mx+b using floats

Hello GCC 20100110 volatile running @16.000MHz
  cycles     secs     flops  OPS/MHz [iter] operation
    1000 0.000063 15999999.0 999999.9 [1000] overhead loops
  124235 0.007765  128788.2   8049.3 [1000] fp adds
  152630 0.009539  104828.7   6551.8 [1000] fp mults
  494280 0.030893   32370.3   2023.1 [1000] fp divs
 2347665 0.146729    6815.3    426.0 [1000] log
  494094 0.030881   32382.5   2023.9 [1000] sqrt
  496094 0.031006   32252.0   2015.7 [1000] pow
   85043 0.005315  188140.1  11758.8 [1000] y=mx+b using longs
  278897 0.017431   57368.9   3585.6 [1000] y=mx+b using floats

Hello CodeVision v2.04.9a running @16.000MHz
  cycles     secs     flops  OPS/MHz [iter] operation
    2000 0.000125 7999999.5 500000.0 [1000] overhead loops
  142573 0.008911  112223.2   7014.0 [1000] fp adds
  223662 0.013979   71536.5   4471.0 [1000] fp mults
  789300 0.049331   20271.1   1266.9 [1000] fp divs
 3716299 0.232269    4305.4    269.1 [1000] log
 8271666 0.516979    1934.3    120.9 [1000] sqrt
 7665039 0.479065    2087.4    130.5 [1000] pow
  118052 0.007378  135533.5   8470.8 [1000] y=mx+b using longs
  332208 0.020763   48162.6   3010.2 [1000] y=mx+b using floats

Note that you can 'count' several statements at once. Simply enclose argument within braces:

     CYCLE_COUNT({j = n - i; ix = i; im = j;});

In all honesty, this is simpler than using the Simulator. You can 'correct' the cycle count if you are obsessive. You can also time real-life peripherals too. Of course it uses a 16-bit timer, but you can use an 8-bit timer with prescale. You are already counting the overflows. Just change the MACRO()s

The method works ok with JTAG too.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It must be time to check the generated code!
a FP mul should not even take 50 clk!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have never really bothered to trace compiler private functions.

From my results, the C statement 'z = x * y' takes 152 cycles with GCC on average.
Bear in mind that you have to push two f-p values onto the stack, evaluate an expression, then store back into the f-p variable's memory.

I can only suggest that you submit some better ASM code to the GCC team.

In the interests of ImageCraft and Codevision, you could offer it to Richard and Pavel too.

It would be nice to think that AVR compilers used the best possible internal math functions. OTOH, this requires some effort. Very few people use f-p, and when they do, it is seldom critical.

The first Z80, 68000, 8086 compilers never had optimum software f-p. Commercial competition ensured that vendors spent money on it.

Nowadays, PCs have hardware f-p. I doubt that anyone would bother writing software f-p from scratch today.

All the same, I would put f-p efficiency way down my list of priorities.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I just find it sad that a 6502 BASIC program can do sin(x) in 40 bit format (that's without a HW MUL!) in about 10000 clk and the fastest AVR code (32bit FP) only is about 5 times faster!

Quote:
From my results, the C statement 'z = x * y' takes 152 cycles with GCC on average.

I don't hope so!
I assume that a big part of the time is to convert int to FP (do the normalisation take some time).
The max/min for the FP mul should be very close to the AVG unless it's a special case like 0 or 1, there is max 2 shifts in the normalisation of the result.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

only is about 5 times faster!


It's still just an 8bit micro and a RISC rather than a CISC - why would you expect it to be even as good as a 6502 let alone 5 times faster?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

a 6502 is 1/2 mips pr. MHz.
no HW mul
and it was BASIC so a big overhead.

edit and the format was 32bit and not 24 bit for the mul itself.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I thought commercial compilers outperform the free ones in all aspects (except cost).

George.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

@sparrow2,

All the algorithms for software f-p have been known for years. They require little change from one 8-bit CPU to another.

OTOH, I have no interest or inclination to write the required optimal ASM for an AVR core.

If I did have the inclination, then I am sure it is not too difficult. The AVR world is waiting for you to do the work.

The existing GCC source code is in the public domain. Why not improve it? (or re-write completely if applicable)

The 6502 CPU was a lovely chip to work with. OTOH, some instructions were fairly slow, especially with the more esoteric addressing modes. However it was the addressing modes that made it so nice to work with.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I normally don't need FP, and the only times I have used it was because integer with the needed dynamic range was to slow.
And no I have no plans of makeing some general rutines that can be slowed down when used with a compiler.

And yes I sometimes use FP to show results in the correct SI units, and not ADC units etc. but then nobody cares about the speed.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
And no I have no plans of makeing some general rutines that can be slowed down when used with a compiler.

You know perfectly well that that statement is idiotic.

An internal math function has one or two f-p inputs and an f-p output. GCC passes the arguments in registers, and returns them in registers. I cannot think of any better arrangement.

So these functions are usable with any language or model.

If you think it is important to have faster ASM functions, it is equally applicable to anything else.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
You know perfectly well that that statement is idiotic.


I can hear that you never have made any realtime programming, and show the lack of knowledge very well with this kind of comments!

ex. could be a IIR filter, that live in it's own registers

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

You know perfectly well that that statement is idiotic.

An internal math function has one or two f-p inputs and an f-p output. GCC passes the arguments in registers, and returns them in registers. I cannot think of any better arrangement.


David try writing a test code in GCC that just does an fp * for example and you'll find that the compiler doesn't just call a single _fpmul but calls a stage of functions (can't remember the exact details but things like _fpnormalise, fptestsign, that kind of thing) so I think Sparrow is right. Even if he provided an optimised core _fpfinallydothemul it would be watered down by the preceding steps the compiler generates. I suspect you'd need to get to the route of Generic GCCs fp handling (not just he AVR support functions) to really have an impact.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

And the format/structure of IEEE754 FP dosn't help, if you need speed keep it in a 5 byte structure.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I will have a look sometime. As a general rule, a multiplication or division does not require a normalise.

OTOH, addition and subtraction are relatively complicated. You have to normalise both before and after.

From distant memory, f-p on a 6502 used a larger 'unpacked' accumulator for intermediate expressions.

I will shut up for the moment. (until I have investigated some actual AVR code)

All the same, an arithmetic operation is always two inputs producing one output. Assignment of the eventual result to memory is a common task.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I will have a look sometime.

here's an example:

volatile float f, g, h;

int main(void) {
	f = g * h;
  90:	60 91 04 01 	lds	r22, 0x0104
  94:	70 91 05 01 	lds	r23, 0x0105
  98:	80 91 06 01 	lds	r24, 0x0106
  9c:	90 91 07 01 	lds	r25, 0x0107
  a0:	20 91 08 01 	lds	r18, 0x0108
  a4:	30 91 09 01 	lds	r19, 0x0109
  a8:	40 91 0a 01 	lds	r20, 0x010A
  ac:	50 91 0b 01 	lds	r21, 0x010B
  b0:	0e 94 65 00 	call	0xca	; 0xca <__mulsf3>
  b4:	60 93 00 01 	sts	0x0100, r22
  b8:	70 93 01 01 	sts	0x0101, r23
  bc:	80 93 02 01 	sts	0x0102, r24
  c0:	90 93 03 01 	sts	0x0103, r25
  c4:	80 e0       	ldi	r24, 0x00	; 0
  c6:	90 e0       	ldi	r25, 0x00	; 0
  c8:	08 95       	ret

So far so good if that called function were just doing the guts of the fpMul. However:

000000ca <__mulsf3>:
  ca:	0b d0       	rcall	.+22     	; 0xe2 <__mulsf3x>
  cc:	78 c0       	rjmp	.+240    	; 0x1be <__fp_round>
  ce:	69 d0       	rcall	.+210    	; 0x1a2 <__fp_pscA>
  d0:	28 f0       	brcs	.+10     	; 0xdc <__mulsf3+0x12>
  d2:	6e d0       	rcall	.+220    	; 0x1b0 <__fp_pscB>
  d4:	18 f0       	brcs	.+6      	; 0xdc <__mulsf3+0x12>
  d6:	95 23       	and	r25, r21
  d8:	09 f0       	breq	.+2      	; 0xdc <__mulsf3+0x12>
  da:	5a c0       	rjmp	.+180    	; 0x190 <__fp_inf>
  dc:	5f c0       	rjmp	.+190    	; 0x19c <__fp_nan>
  de:	11 24       	eor	r1, r1
  e0:	a2 c0       	rjmp	.+324    	; 0x226 <__fp_szero>

I won't quote the body of all those but I think you get the idea?

=============================================================================

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
I will have a look sometime. As a general rule, a multiplication or division does not require a normalise.


??? for the input yes but not for the result:

FP Mul is [0.5 .. 1.0[ * [0.5 .. 1.0[ = [0.25 .. 1[

so as a general rule you have normalise 50% of the time.

And for division the result can be worse!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yup. I had a look at libc-1.4.6 source code which happened to be on my PC. Yours is much the same. i.e. it splits the float into an internal representation, does the muls3f3 and puts it back again.

I presume that it does complex expressions in the internal format. No doubt you would choose to use this internal representation yourself if you want to write some new f-p functions in ASM.

But at some stage you have to put the internal representation back into the SRAM memory. This applies to ASM programs too. (unless you are going to use 5 or 6 bytes storage for single-precision floats)

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If you want the ASM speed it's better to use a 2's complement FP, so it's faster to convert to and from "normal" integer numbers.

Quote:
unless you are going to use 5 or 6 bytes storage for single-precision floats

4 extra clk for load/store means nothing compared to the calc time !

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You can use whatever format or size accumulator that you like for internal calculations.

To conform to the IEEE format for storage is only required at the beginning and end of a calculation.

Incidentally, I timed some IAR C statements. They are of a similar time to CV.

I would be interested in your opinions about achievable C times. There are effectively 3 components to a calculation:
1. unpacking into a suitable format for efficient calculation.
2. performing the calculation.
3. normalising and converting back to the IEEE storage format.

I would guess that (3) has the most scope for improvement.

David.

Pages