Well, I posted what I used above. Cliff also did something very similar.
David,
If you are going to do it use what Lee posted as the baseline. It would be good if there was a across the compilers version - Lee's code is very close.
HOWEVER note that the GCC optimiser is going to throw away almost everything as it's full of pointless loops. I guess the solution is to make all the accessed variables 'volatile' but the question then arises whether this is a valid benchmark as you would not typically do that in a "real" app and you'd let the optimiser throw away anything that looked pointless. So it'll be showing GCC in a reduced light. The alternative is to not make them volatile, let the optimiser discard them and we can all sit back and say "isn't GCC amazing - it's incredibly fast!" ;-)
It doesn't recompile the math libs when you compile with optimization off, so just just run it with no optimization. The loops are what evens out the data dependency in the calcs. I thought it was Real Clever.
I guess the solution is to make all the accessed variables 'volatile' but the question then arises whether this is a valid benchmark as you would not typically do that in a "real" app and you'd let the optimiser throw away anything that looked pointless. So it'll be showing GCC in a reduced light. The alternative is to not make them volatile, let the optimiser discard them and we can all sit back and say "isn't GCC amazing - it's incredibly fast!" Wink
I thought of that a bit. One could make a dummy assignment of the "result" of the last calculation in the loop to a dummy volatile variable after the loop is over. Now, if the GCC optimizer is way clever it would just make the last run through the loop ...but that would be WAY clever.
As benchmarks go, this exercise lacks--agreed. But as David said
Quote:
Meanwhile the current number bandying is fairly pointless.
:twisted: That said, people here would be interested in the relative cost of FP primitives and representative functions such as sin() and log() and ... .
Besides the practical value of knowing the costs in various toolchains, a round of Compiler Wars is a welcome summer diversion.
Oh, BTW, speaking of number-bandying...Bob said
Quote:
I changed to using timer1 with ps=3 so its counting 1us clocks, but more importantly, its only interrupting every 65536 usecs, and my results are Real Close to yours.
Now, from the earlier posts I got the impression that Bob was running on a Mega32 at 16MHz. (some earlier indications were a Mega128, but no matter in what follows...)
1us per timer tick, eh? /16 prescaler? 16-bit timers on Mega32 (or Mega128) don't have a /16 prescaler option. For a "ps=3" that is /64 so your final results would then be 1/4 the operations per second that the printout showed? I think the steroids are kicking in again. [David, that is exactly why I keep saying that the Emperor has no clothes...]
Lee
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
Yet the .c he posted and I did battle with seemed to be setting the UART and timers as if it were 18.432. What's more it was hard-coded numbers rather than something calculated from F_CPU. I do hope Bob remembers to update those numbers each time he changes clock speed (though I guess the UART one will "bite").
I appreciate the feedback. Having several sets of eyes look at the timer init keeps the numbers correct. My previous slow results were on a mega32, 16mhz xtal, and timer0 interrupting every 100usec on a compare. The newer faster results were the result of using timer 1 with a 3 in the ps selector and ovf int enabled. How fast does that one count? 250KHz 4usec? If so, I guess those last results were 4x too big. Sorry. F_CPU is a gcc thing. Doesnt help me. I pick 16mhz from a pulldown in the IDE and run the AppBuilder. That's what spits out the timer init numbers. I guess you have to trust it or check up on it?
No it isn't. Any AVR program on any C compiler can do:
#define F_CPU 1843200UL
if it chooses to. It doesn't have to be F_CPU - that particular name just happens to be the one the AVR-LibC authors chose way back when. Call it:
#define MY_CLOCK_IS_RUNNING_AT 1843200UL
if you prefer but unless your build environment already provides such a macro then define one is a "Good Idea(tm)" because when you change the crystal or the CKSEL fuses you change one number in one place for the entire project and all your timers and UARTs and anything else using numbers based on the core CPU speed fall into line.
Quote:
That's what spits out the timer init numbers
With a comment I hope! But it's not the best solution for the reason I just gave. Having one system wide defined macro from which everything is derived (at compile time) makes it less likely that you forget to fix up one of the calculations when you change the clock.
Taking Bob's latest test run numbers and correcting for the 4us timer tick, the results end up entirely consistent with his earlier numbers with the fast timer tick. (Servicing the fast timer tick was <10% effect.)
CodeVision 2.04.5 ImageCraft 7.23 ImageCraft using timer
OPS/ OPS/ 4us NET CYCLES/ OPS/
TEST MHz MHz Ticks CYCLES OP MHz
ADD 8586 2693 5553 355392 355 2814
MULT 5063 2765 5426 347264 347 2880
MULT1 5132 2500 5999 383936 384 2605
MULT0 21739 5208 2860 183040 183 5463
DIV 1311 993 15094 966016 966 1035
SIN 310 97 154638 9896832 9897 101
LOG 282 108 138253 8848192 8848 113
That wouldn't be the set that now contains the edit:
Quote:
Wednesday: these are bogus results because I was calculating the timer count wrong. Disregard.
would it?
Yes, it is. As I mentioned, I accounted for the 4us tick and assumed the tick count was correct. After letting the spreadsheet grind the results are entirely consistent with earlier normalized results. And indicate that rather than the steroids taking effect overnight to give a 4x boost, it was just PhotoShop.
Lee
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
So CV is 3X faster. I'll be darned. I guess I get 45 flops per ms and you get 137 flops per ms at 16MHz. I only get 1 or 2 sines per ms, you get 3X that. I hope this info gives someone an idea of how to account for fp calc time. Thanks for checking my numbers.
These are some results in a tabular form. I agree with Lee. It is only sensible when normalised to OPS/MHz.
Note that the GCC code is only really comparable with 'volatile' global variables. Even so, f-p operations are heavily dependent on the actual values being used.
This is my 'adjusted' version of Bob's code. I was running on a 16MHz mega168.
#include
#include
// this header handles system includes and F_CPU, ISR() macros
#include "../scripts/portab_kbv.h"
#if defined(__CODEVISIONAVR__)
#define TIMER1_OVF_vect TIM1_OVF
#define COMPILER "CodeVision v2.04.9a"
#elif defined(__IMAGECRAFT__)
#define TIMER1_OVF_vect iv_TIM1_OVF
#define COMPILER "ImageCraft v7"
#elif defined(__GNUC__)
#define COMPILER "GCC 20100110 volatile"
#elif defined(__IAR_SYSTEMS_ICC__)
#define COMPILER "IAR 5.3"
#endif
//-----globals----------
unsigned int tofs;
unsigned long t1, t2;
unsigned long dt, ohdt, net;
volatile int n, i, j;
volatile float x, y, z, m, b;
volatile long ix, iy, iz, im, ib;
float sec, flops;
char buf1[32], buf2[32]; // tiny buffers cos m168 has little SRAM
#include
extern void initstdio(void); // start up stdio to USART @ 9600 baud
float secs, secpertic = 1.0 / F_CPU;
unsigned long overflows;
unsigned long gettics(void)
{
unsigned long time;
CLI();
TCCR1B = 0; // stop timer
time = overflows + TCNT1L;
time += (TCNT1H<<8u);
TCCR1B = (1<<CS10); // start again
SEI(); // with overflows
return time;
}
ISR(TIMER1_OVF_vect)
{
overflows += 65536u;
}
// yes, I know that there are function overheads for both report() and gettics()
// I could use a macro to stop/start Timer1 if anyone is worried.
void report(char *title)
{
t2 = gettics();
dt = t2 - t1;
net = dt - ohdt;
secs = net * secpertic;
flops = n / secs;
printf("% 8lu % 8.6f % 9.1f % 8.1f [% 4d] % s\r\n", net, secs, flops, flops * 1e6 / F_CPU, n, title);
}
void main(void)
{
//fpbench main program
volatile char c;
initstdio();
TCCR1B = (1<<CS10);
TIMSK1 = (1<<TOIE1);
SEI();
printf("\r\nHello " COMPILER " running @% 6.3fMHz\r\n", 0.000001 * F_CPU);
c = 0;
n = 1000;
while (c != 'q') {
n = 1000;
printf(" cycles secs flops OPS/MHz [iter] operation\r\n");
t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = x;
}
report("overhead loops");
t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = x + y;
}
report("fp adds");
t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = x * y;
}
report("fp mults");
t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = x * 1;
}
report("fp mults by 1");
t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = x * 0;
}
report("fp mults by 0");
t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = x / y;
}
report("fp divs");
t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = x / 1.0; //this didnt work with 1
}
report("fp div by 1");
t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = sin(x);
}
report("sin");
t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = log(x);
}
report("log");
t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = sqrt(x);
}
report("sqrt");
t1 = gettics();
for (i = 0; i < n; i++) {
j = n - i;
x = i;
y = j;
z = pow(x, .5);
}
report("pow");
t1 = gettics();
ib = 1;
for (i = 0; i < n; i++) {
j = n - i;
ix = i;
im = j;
iy = im * ix + ib;
}
report("y=mx+b using longs");
t1 = gettics();
b = 1.0;
for (i = 0; i < n; i++) {
j = n - i;
x = i;
m = j;
y = m * x + b;
}
report("y=mx+b using floats");
n = 1;
t1 = gettics();
for (i = 0; i < n; i++) {
for (j = 0; j < 32767; j++) {
memcpy(buf2, buf1, sizeof(buf1)); //took 5 sec for 10 megs... about 2 megs/sec
}
}
report("memcpy 1MB");
c = 'q';
}
}
Yes. It would be very wise to re-write the 'tests' to confound any optimiser.
No. I cannot compile with either IAR or ImageCraft cos I only have evaluation versions.
Personally, I would reset the timer before each test. You run a certain risk of granularity with the overflow ISR(). i.e. if your function just happens to start or finish at an overflow.
I also see no great point in counting to the nearest cycle since you are already doing 1000 iterations.
In fact I have always just timed a single operation. Providing you stop the timer immediately, you can take as long as you like to process any results.
No. I cannot understand the difference in results. I am fairly confident that I have my calculations ok. OTOH, the timed 'sequences' are fairly pointless. I also compiled for a mega128. The is about the same. Stupid loops or arithmetic are slightly different.
765206 0.047825 20909.4 1306.8 [1000] fp mults by 0
Besides just the general differences by a factor between your results and mine with the simulator (no >>wonder<< Bob doesn't like to use the simulator :twisted: ), this particular one makes no sense. In all my runs since 2004 on this "fpbench" it appeared as though MULT0 was a special case, and 4x the normal mult. Your results are [roughly] the same as ans even worse than "normal".
Gonna have to set this up...
Lee
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
I might change the 't1= gettics()' to 'stop();TCNT1=0;overflows=0;start();'
This would alleviate the overflows. (as would prescaling the timer)
Likewise, what is the point of 1000 iterations?
Nothing overflows with n=1.
I may re-arrange tomorrow so that I can compile with 4k IAR or ICC evaluations.
The loop across n is the clever part I invented. See how it multiplies 1 x 999 then 2 x 998 then 3 by 997? This eliminates/averages out the speed dependency on data.
Warning: C:\AtmelC\dp.c(101): overflow is possible in 16 bit shift left, casting shifted operand to 'long' may be required
time += (TCNT1H<<8u);
First run results below, and are consistent with your CV results. Mega1280 running at 7.3728MHz, using USART1 in RS485 mode. ;) Now, why are they so far off from my cycle-counting...?
Thanks for the comparison work folks. I think we are getting good results that we can quote with confidence now. The original question was about how to get a sin function either faster or more accurately (cant have both!). In the lo res mono chrome graphics I've done, rotating a box on a 240x160 screen looked ok with a 1 degree table, each table entry was a signed int, representing a fraction 0 to .9999. The SG workstations running opengl used a tenth degree table, giving 3600 angles per circle, but they had plenty of ram. If the Original Poster reads this, I hope it answers the original question.
In the lo res mono chrome graphics I've done, rotating a box on a 240x160 screen looked ok with a 1 degree table, each table entry was a signed int, representing a fraction 0 to .9999. The SG workstations running opengl used a tenth degree table, giving 3600 angles per circle, but they had plenty of ram. If the Original Poster reads this, I hope it answers the original question.
I'd have to dig out some >>real<< old stuff which I don't think I have anymore, but what works quite well for rotating a wire-frame is to examine the transformation equation and look for simplification. IIRC the key was that the cos of the rotation angle was an important part of that equation, and cos and cos*cos of a small angle is so close to 1 that you just use 1 and then the calculations get much faster.
When you finally stop and settle, then you have the time to redraw it "right".
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
OK, it makes more sense now. I think you are missing a line--"ohdt" was never set so the counts are the overhead >>plus<< the time for the operations.
Yup. That is the cause of the anomaly. I ignored the 'loop overhead' which is actually a significant proportion of the execution time.
IMHO, you need to devise tests that do not rely on subtracting two similar times. And quite honestly you can time an individual C statement easily. Just start() and stop() the Timer1 (effectively SBI and CBI). You can still accumulate the times for several input values, and of course separate functions and automatic casts.
The extended loop idea is best suited to a low-resolution system timer. Years ago, I used to profile 68000 programs using just a 200Hz system tick.
Since you can count AVR cycles exactly with Timer1, there is no need to do the statistical approach.
These results are obtained by adding the 'ohdt=dt;' loop overhead statement, and removing the 'z = x;' from the overhead. This makes the ohdt subtraction give +ve answers except in the 'y=mx+b using longs' test.
But now, when someone says "I'm trying to run my autopilot at 200Hz, and it looks like I need to do about 300 fp ops. How can I tell if it busts my time budget?" we can tell him he can do 100 fp adds and mults per ms with GCC, so he has 3ms worth of fp calcs in his 5ms loop.
Note that you can 'count' several statements at once. Simply enclose argument within braces:
CYCLE_COUNT({j = n - i; ix = i; im = j;});
In all honesty, this is simpler than using the Simulator. You can 'correct' the cycle count if you are obsessive. You can also time real-life peripherals too. Of course it uses a 16-bit timer, but you can use an 8-bit timer with prescale. You are already counting the overflows. Just change the MACRO()s
I have never really bothered to trace compiler private functions.
From my results, the C statement 'z = x * y' takes 152 cycles with GCC on average.
Bear in mind that you have to push two f-p values onto the stack, evaluate an expression, then store back into the f-p variable's memory.
I can only suggest that you submit some better ASM code to the GCC team.
In the interests of ImageCraft and Codevision, you could offer it to Richard and Pavel too.
It would be nice to think that AVR compilers used the best possible internal math functions. OTOH, this requires some effort. Very few people use f-p, and when they do, it is seldom critical.
The first Z80, 68000, 8086 compilers never had optimum software f-p. Commercial competition ensured that vendors spent money on it.
Nowadays, PCs have hardware f-p. I doubt that anyone would bother writing software f-p from scratch today.
All the same, I would put f-p efficiency way down my list of priorities.
I just find it sad that a 6502 BASIC program can do sin(x) in 40 bit format (that's without a HW MUL!) in about 10000 clk and the fastest AVR code (32bit FP) only is about 5 times faster!
Quote:
From my results, the C statement 'z = x * y' takes 152 cycles with GCC on average.
I don't hope so!
I assume that a big part of the time is to convert int to FP (do the normalisation take some time).
The max/min for the FP mul should be very close to the AVG unless it's a special case like 0 or 1, there is max 2 shifts in the normalisation of the result.
All the algorithms for software f-p have been known for years. They require little change from one 8-bit CPU to another.
OTOH, I have no interest or inclination to write the required optimal ASM for an AVR core.
If I did have the inclination, then I am sure it is not too difficult. The AVR world is waiting for you to do the work.
The existing GCC source code is in the public domain. Why not improve it? (or re-write completely if applicable)
The 6502 CPU was a lovely chip to work with. OTOH, some instructions were fairly slow, especially with the more esoteric addressing modes. However it was the addressing modes that made it so nice to work with.
I normally don't need FP, and the only times I have used it was because integer with the needed dynamic range was to slow.
And no I have no plans of makeing some general rutines that can be slowed down when used with a compiler.
And yes I sometimes use FP to show results in the correct SI units, and not ADC units etc. but then nobody cares about the speed.
And no I have no plans of makeing some general rutines that can be slowed down when used with a compiler.
You know perfectly well that that statement is idiotic.
An internal math function has one or two f-p inputs and an f-p output. GCC passes the arguments in registers, and returns them in registers. I cannot think of any better arrangement.
So these functions are usable with any language or model.
If you think it is important to have faster ASM functions, it is equally applicable to anything else.
You know perfectly well that that statement is idiotic.
An internal math function has one or two f-p inputs and an f-p output. GCC passes the arguments in registers, and returns them in registers. I cannot think of any better arrangement.
David try writing a test code in GCC that just does an fp * for example and you'll find that the compiler doesn't just call a single _fpmul but calls a stage of functions (can't remember the exact details but things like _fpnormalise, fptestsign, that kind of thing) so I think Sparrow is right. Even if he provided an optimised core _fpfinallydothemul it would be watered down by the preceding steps the compiler generates. I suspect you'd need to get to the route of Generic GCCs fp handling (not just he AVR support functions) to really have an impact.
Yup. I had a look at libc-1.4.6 source code which happened to be on my PC. Yours is much the same. i.e. it splits the float into an internal representation, does the muls3f3 and puts it back again.
I presume that it does complex expressions in the internal format. No doubt you would choose to use this internal representation yourself if you want to write some new f-p functions in ASM.
But at some stage you have to put the internal representation back into the SRAM memory. This applies to ASM programs too. (unless you are going to use 5 or 6 bytes storage for single-precision floats)
You can use whatever format or size accumulator that you like for internal calculations.
To conform to the IEEE format for storage is only required at the beginning and end of a calculation.
Incidentally, I timed some IAR C statements. They are of a similar time to CV.
I would be interested in your opinions about achievable C times. There are effectively 3 components to a calculation:
1. unpacking into a suitable format for efficient calculation.
2. performing the calculation.
3. normalising and converting back to the IEEE storage format.
I would guess that (3) has the most scope for improvement.
David,
If you are going to do it use what Lee posted as the baseline. It would be good if there was a across the compilers version - Lee's code is very close.
HOWEVER note that the GCC optimiser is going to throw away almost everything as it's full of pointless loops. I guess the solution is to make all the accessed variables 'volatile' but the question then arises whether this is a valid benchmark as you would not typically do that in a "real" app and you'd let the optimiser throw away anything that looked pointless. So it'll be showing GCC in a reduced light. The alternative is to not make them volatile, let the optimiser discard them and we can all sit back and say "isn't GCC amazing - it's incredibly fast!" ;-)
- Log in or register to post comments
TopIt doesn't recompile the math libs when you compile with optimization off, so just just run it with no optimization. The loops are what evens out the data dependency in the calcs. I thought it was Real Clever.
Imagecraft compiler user
- Log in or register to post comments
TopI guess I can't get my brain around how one could build a benchmark without optimization but I take your point about the optimized lib code.
- Log in or register to post comments
TopI thought of that a bit. One could make a dummy assignment of the "result" of the last calculation in the loop to a dummy volatile variable after the loop is over. Now, if the GCC optimizer is way clever it would just make the last run through the loop ...but that would be WAY clever.
As benchmarks go, this exercise lacks--agreed. But as David said
Besides the practical value of knowing the costs in various toolchains, a round of Compiler Wars is a welcome summer diversion.
Oh, BTW, speaking of number-bandying...Bob said
Now, from the earlier posts I got the impression that Bob was running on a Mega32 at 16MHz. (some earlier indications were a Mega128, but no matter in what follows...)
1us per timer tick, eh? /16 prescaler? 16-bit timers on Mega32 (or Mega128) don't have a /16 prescaler option. For a "ps=3" that is /64 so your final results would then be 1/4 the operations per second that the printout showed? I think the steroids are kicking in again. [David, that is exactly why I keep saying that the Emperor has no clothes...]
Lee
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
TopI was led to believe it was 18.432MHz actually running a mega32 out of spec.
- Log in or register to post comments
TopYou can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
TopYet the .c he posted and I did battle with seemed to be setting the UART and timers as if it were 18.432. What's more it was hard-coded numbers rather than something calculated from F_CPU. I do hope Bob remembers to update those numbers each time he changes clock speed (though I guess the UART one will "bite").
- Log in or register to post comments
TopI appreciate the feedback. Having several sets of eyes look at the timer init keeps the numbers correct. My previous slow results were on a mega32, 16mhz xtal, and timer0 interrupting every 100usec on a compare. The newer faster results were the result of using timer 1 with a 3 in the ps selector and ovf int enabled. How fast does that one count? 250KHz 4usec? If so, I guess those last results were 4x too big. Sorry. F_CPU is a gcc thing. Doesnt help me. I pick 16mhz from a pulldown in the IDE and run the AppBuilder. That's what spits out the timer init numbers. I guess you have to trust it or check up on it?
Imagecraft compiler user
- Log in or register to post comments
TopNo it isn't. Any AVR program on any C compiler can do:
if it chooses to. It doesn't have to be F_CPU - that particular name just happens to be the one the AVR-LibC authors chose way back when. Call it:
if you prefer but unless your build environment already provides such a macro then define one is a "Good Idea(tm)" because when you change the crystal or the CKSEL fuses you change one number in one place for the entire project and all your timers and UARTs and anything else using numbers based on the core CPU speed fall into line.
With a comment I hope! But it's not the best solution for the reason I just gave. Having one system wide defined macro from which everything is derived (at compile time) makes it less likely that you forget to fix up one of the calculations when you change the clock.
- Log in or register to post comments
TopTaking Bob's latest test run numbers and correcting for the 4us timer tick, the results end up entirely consistent with his earlier numbers with the fast timer tick. (Servicing the fast timer tick was <10% effect.)
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
TopThat wouldn't be the set that now contains the edit:
would it?
- Log in or register to post comments
TopYes, it is. As I mentioned, I accounted for the 4us tick and assumed the tick count was correct. After letting the spreadsheet grind the results are entirely consistent with earlier normalized results. And indicate that rather than the steroids taking effect overnight to give a 4x boost, it was just PhotoShop.
Lee
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
TopSo CV is 3X faster. I'll be darned. I guess I get 45 flops per ms and you get 137 flops per ms at 16MHz. I only get 1 or 2 sines per ms, you get 3X that. I hope this info gives someone an idea of how to account for fp calc time. Thanks for checking my numbers.
Imagecraft compiler user
- Log in or register to post comments
TopThese are some results in a tabular form. I agree with Lee. It is only sensible when normalised to OPS/MHz.
Note that the GCC code is only really comparable with 'volatile' global variables. Even so, f-p operations are heavily dependent on the actual values being used.
This is my 'adjusted' version of Bob's code. I was running on a 16MHz mega168.
Yes. It would be very wise to re-write the 'tests' to confound any optimiser.
No. I cannot compile with either IAR or ImageCraft cos I only have evaluation versions.
David.
- Log in or register to post comments
TopHmmm--I'll have to poke at this, and see why your CV numbers don't mesh with mine using the simulator.
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
TopPersonally, I would reset the timer before each test. You run a certain risk of granularity with the overflow ISR(). i.e. if your function just happens to start or finish at an overflow.
I also see no great point in counting to the nearest cycle since you are already doing 1000 iterations.
In fact I have always just timed a single operation. Providing you stop the timer immediately, you can take as long as you like to process any results.
No. I cannot understand the difference in results. I am fairly confident that I have my calculations ok. OTOH, the timed 'sequences' are fairly pointless. I also compiled for a mega128. The is about the same. Stupid loops or arithmetic are slightly different.
Of course Pavel may have changed the compiler between versions. Both my mega128 and mga128 are small model, min size, max optimisation.
David.
- Log in or register to post comments
TopBesides just the general differences by a factor between your results and mine with the simulator (no >>wonder<< Bob doesn't like to use the simulator :twisted: ), this particular one makes no sense. In all my runs since 2004 on this "fpbench" it appeared as though MULT0 was a special case, and 4x the normal mult. Your results are [roughly] the same as ans even worse than "normal".
Gonna have to set this up...
Lee
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
TopI am going down to the pub.
I might change the 't1= gettics()' to 'stop();TCNT1=0;overflows=0;start();'
This would alleviate the overflows. (as would prescaling the timer)
Likewise, what is the point of 1000 iterations?
Nothing overflows with n=1.
I may re-arrange tomorrow so that I can compile with 4k IAR or ICC evaluations.
David.
- Log in or register to post comments
TopThe loop across n is the clever part I invented. See how it multiplies 1 x 999 then 2 x 998 then 3 by 997? This eliminates/averages out the speed dependency on data.
Imagecraft compiler user
- Log in or register to post comments
TopDavid--
Didja get this warning--
First run results below, and are consistent with your CV results. Mega1280 running at 7.3728MHz, using USART1 in RS485 mode. ;) Now, why are they so far off from my cycle-counting...?
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
TopOK, it makes more sense now. I think you are missing a line--"ohdt" was never set so the counts are the overhead >>plus<< the time for the operations.
...and then I get
The anomalies happen when e.g. "using longs" the time ends up less than the overhead. Got to look at that MULT0 a bit.
But now I don't really care too much. The numbers in general for the operations are consistent with the ones I posted earlier.
Did the GCC version have the overhead subtracted?
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
TopThanks for the comparison work folks. I think we are getting good results that we can quote with confidence now. The original question was about how to get a sin function either faster or more accurately (cant have both!). In the lo res mono chrome graphics I've done, rotating a box on a 240x160 screen looked ok with a 1 degree table, each table entry was a signed int, representing a fraction 0 to .9999. The SG workstations running opengl used a tenth degree table, giving 3600 angles per circle, but they had plenty of ram. If the Original Poster reads this, I hope it answers the original question.
Imagecraft compiler user
- Log in or register to post comments
TopI'd have to dig out some >>real<< old stuff which I don't think I have anymore, but what works quite well for rotating a wire-frame is to examine the transformation equation and look for simplification. IIRC the key was that the cos of the rotation angle was an important part of that equation, and cos and cos*cos of a small angle is so close to 1 that you just use 1 and then the calculations get much faster.
When you finally stop and settle, then you have the time to redraw it "right".
You can put lipstick on a pig, but it is still a pig.
I've never met a pig I didn't like, as long as you have some salt and pepper.
- Log in or register to post comments
TopYup. That is the cause of the anomaly. I ignored the 'loop overhead' which is actually a significant proportion of the execution time.
IMHO, you need to devise tests that do not rely on subtracting two similar times. And quite honestly you can time an individual C statement easily. Just start() and stop() the Timer1 (effectively SBI and CBI). You can still accumulate the times for several input values, and of course separate functions and automatic casts.
The extended loop idea is best suited to a low-resolution system timer. Years ago, I used to profile 68000 programs using just a 200Hz system tick.
Since you can count AVR cycles exactly with Timer1, there is no need to do the statistical approach.
These results are obtained by adding the 'ohdt=dt;' loop overhead statement, and removing the 'z = x;' from the overhead. This makes the ohdt subtraction give +ve answers except in the 'y=mx+b using longs' test.
I downloaded the current ICCAVR v8. The comparison with avr-gcc is revealing! If anyone has an IAR licence, we could see how effective IAR is.
Quite honestly, these tests are not the best way of timing C library functions. OTOH, they satisfy Bob's methodology.
David.
- Log in or register to post comments
TopBut now, when someone says "I'm trying to run my autopilot at 200Hz, and it looks like I need to do about 300 fp ops. How can I tell if it busts my time budget?" we can tell him he can do 100 fp adds and mults per ms with GCC, so he has 3ms worth of fp calcs in his 5ms loop.
Imagecraft compiler user
- Log in or register to post comments
TopThose results definitely confirm my reservations about ICC ;-)
- Log in or register to post comments
TopI re-wrote the fpbench program to cycle-count in a more sensible way. i.e. count cycles for a specific sequence.
I have made no attempt to subtract the 'counting' overhead. As you can see, it is only 1 or 2 cycles per iteration anyway.
Note that you can 'count' several statements at once. Simply enclose argument within braces:
In all honesty, this is simpler than using the Simulator. You can 'correct' the cycle count if you are obsessive. You can also time real-life peripherals too. Of course it uses a 16-bit timer, but you can use an 8-bit timer with prescale. You are already counting the overflows. Just change the MACRO()s
The method works ok with JTAG too.
David.
- Log in or register to post comments
TopIt must be time to check the generated code!
a FP mul should not even take 50 clk!
- Log in or register to post comments
TopI have never really bothered to trace compiler private functions.
From my results, the C statement 'z = x * y' takes 152 cycles with GCC on average.
Bear in mind that you have to push two f-p values onto the stack, evaluate an expression, then store back into the f-p variable's memory.
I can only suggest that you submit some better ASM code to the GCC team.
In the interests of ImageCraft and Codevision, you could offer it to Richard and Pavel too.
It would be nice to think that AVR compilers used the best possible internal math functions. OTOH, this requires some effort. Very few people use f-p, and when they do, it is seldom critical.
The first Z80, 68000, 8086 compilers never had optimum software f-p. Commercial competition ensured that vendors spent money on it.
Nowadays, PCs have hardware f-p. I doubt that anyone would bother writing software f-p from scratch today.
All the same, I would put f-p efficiency way down my list of priorities.
David.
- Log in or register to post comments
TopI just find it sad that a 6502 BASIC program can do sin(x) in 40 bit format (that's without a HW MUL!) in about 10000 clk and the fastest AVR code (32bit FP) only is about 5 times faster!
I don't hope so!
I assume that a big part of the time is to convert int to FP (do the normalisation take some time).
The max/min for the FP mul should be very close to the AVG unless it's a special case like 0 or 1, there is max 2 shifts in the normalisation of the result.
- Log in or register to post comments
TopIt's still just an 8bit micro and a RISC rather than a CISC - why would you expect it to be even as good as a 6502 let alone 5 times faster?
- Log in or register to post comments
Topa 6502 is 1/2 mips pr. MHz.
no HW mul
and it was BASIC so a big overhead.
edit and the format was 32bit and not 24 bit for the mul itself.
- Log in or register to post comments
TopI thought commercial compilers outperform the free ones in all aspects (except cost).
George.
PIC32 based Ethernet Shield Arduino Uno hardware compatible
PIC32 based Ethernet Shield with Network Switch Arduino Uno hardware compatible
- Log in or register to post comments
Top@sparrow2,
All the algorithms for software f-p have been known for years. They require little change from one 8-bit CPU to another.
OTOH, I have no interest or inclination to write the required optimal ASM for an AVR core.
If I did have the inclination, then I am sure it is not too difficult. The AVR world is waiting for you to do the work.
The existing GCC source code is in the public domain. Why not improve it? (or re-write completely if applicable)
The 6502 CPU was a lovely chip to work with. OTOH, some instructions were fairly slow, especially with the more esoteric addressing modes. However it was the addressing modes that made it so nice to work with.
David.
- Log in or register to post comments
TopI normally don't need FP, and the only times I have used it was because integer with the needed dynamic range was to slow.
And no I have no plans of makeing some general rutines that can be slowed down when used with a compiler.
And yes I sometimes use FP to show results in the correct SI units, and not ADC units etc. but then nobody cares about the speed.
- Log in or register to post comments
TopYou know perfectly well that that statement is idiotic.
An internal math function has one or two f-p inputs and an f-p output. GCC passes the arguments in registers, and returns them in registers. I cannot think of any better arrangement.
So these functions are usable with any language or model.
If you think it is important to have faster ASM functions, it is equally applicable to anything else.
David.
- Log in or register to post comments
TopI can hear that you never have made any realtime programming, and show the lack of knowledge very well with this kind of comments!
ex. could be a IIR filter, that live in it's own registers
- Log in or register to post comments
TopDavid try writing a test code in GCC that just does an fp * for example and you'll find that the compiler doesn't just call a single _fpmul but calls a stage of functions (can't remember the exact details but things like _fpnormalise, fptestsign, that kind of thing) so I think Sparrow is right. Even if he provided an optimised core _fpfinallydothemul it would be watered down by the preceding steps the compiler generates. I suspect you'd need to get to the route of Generic GCCs fp handling (not just he AVR support functions) to really have an impact.
- Log in or register to post comments
TopAnd the format/structure of IEEE754 FP dosn't help, if you need speed keep it in a 5 byte structure.
- Log in or register to post comments
TopI will have a look sometime. As a general rule, a multiplication or division does not require a normalise.
OTOH, addition and subtraction are relatively complicated. You have to normalise both before and after.
From distant memory, f-p on a 6502 used a larger 'unpacked' accumulator for intermediate expressions.
I will shut up for the moment. (until I have investigated some actual AVR code)
All the same, an arithmetic operation is always two inputs producing one output. Assignment of the eventual result to memory is a common task.
David.
- Log in or register to post comments
Tophere's an example:
So far so good if that called function were just doing the guts of the fpMul. However:
I won't quote the body of all those but I think you get the idea?
- Log in or register to post comments
Top??? for the input yes but not for the result:
FP Mul is [0.5 .. 1.0[ * [0.5 .. 1.0[ = [0.25 .. 1[
so as a general rule you have normalise 50% of the time.
And for division the result can be worse!
- Log in or register to post comments
TopYup. I had a look at libc-1.4.6 source code which happened to be on my PC. Yours is much the same. i.e. it splits the float into an internal representation, does the muls3f3 and puts it back again.
I presume that it does complex expressions in the internal format. No doubt you would choose to use this internal representation yourself if you want to write some new f-p functions in ASM.
But at some stage you have to put the internal representation back into the SRAM memory. This applies to ASM programs too. (unless you are going to use 5 or 6 bytes storage for single-precision floats)
David.
- Log in or register to post comments
TopIf you want the ASM speed it's better to use a 2's complement FP, so it's faster to convert to and from "normal" integer numbers.
4 extra clk for load/store means nothing compared to the calc time !
- Log in or register to post comments
TopYou can use whatever format or size accumulator that you like for internal calculations.
To conform to the IEEE format for storage is only required at the beginning and end of a calculation.
Incidentally, I timed some IAR C statements. They are of a similar time to CV.
I would be interested in your opinions about achievable C times. There are effectively 3 components to a calculation:
1. unpacking into a suitable format for efficient calculation.
2. performing the calculation.
3. normalising and converting back to the IEEE storage format.
I would guess that (3) has the most scope for improvement.
David.
- Log in or register to post comments
TopPages