## Evaluating Elementary Functions in Assembly

29 posts / 0 new
Author
Message

Hey!

I'm working on writing libraries to do all sorts of things like IO and arithmetic. I've come across a problem though and am not sure what the best approach would be.

The problem is that I want to evaluate elementary functions, like sine, exp, etc. to high degree of precision while leaving a low memory footprint. I'm working with an Atmega328P so I have 32k program memory and 2k sram. So far I have implemented some basic functions for signed and unsigned integers of different sizes and also signed floats (IEEE-754). The question now is how to evaluate functions such that I get a precise1 signed float back quickly.

I would see how one can use a table in program memory to lookup certain values. However this costs way too much memory to get precision.

Another way would be to use a Taylor series for the given function. However this costs way too much time to be fast and isn't even that precise.

Maybe it is possible to use interpolation to save space when going with the table approach? I'd like to hear your guys' input on this.

1: by precise I mean it uses all 23 bits of the mantissa.

Be sure to look at CORDIC methods

See here for demo & some code  ...the calculation code seems pretty good & short & quick!

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Sun. Jan 12, 2020 - 11:48 PM

This seems like a good algorithm to investigate. Thanks for the information!

Or an external chip.  A 2Mb serial EEprom can be had in a lil' 8-pin SOIC for a buck-fifty or so*.  You can do the calculations offline to load the EE with your lookup table(s)**.  How much of your time is worth USD\$1.50?  S.

** With a little clever trig, you don't need a lookup table for all the functions...  This is beyond the scope of this remark.

If you were working in Asm why would you need something as high level as floats and sines? Surely as soon as you get to writing apps that MUST have float/sin then you would just switch ot C that already has a IEEE754 based implementation of all that already? Or are you trying to write your own C compiler ?

Because sometimes you have to do trig fast, and it's sometimes faster in assembler.

I don't do much with floats, ackshully, but I do do trig in assembler, in fixed point and mostly with lookups - see "resolver processing".  Life would be so much easier if pi == 3.  S.

Last Edited: Mon. Jan 13, 2020 - 10:46 AM

Scroungre wrote:
and it's faster in assembler.
Faster than what? The implementations of sin() etc and +-*/ for float in a C compiler are almost certainly implemented in hand-crafted Asm too. So does your writing in assembler use some special, fast opcodes that were not available to the Asm programmer who wrote the float library for a C compiler?!?

For example, here is the core of sin() for the fp lib in avr-gcc:

http://svn.savannah.gnu.org/viewvc/avr-libc/trunk/avr-libc/libm/fplib/fp_sinus.S?revision=2473&view=markup

Last Edited: Mon. Jan 13, 2020 - 09:50 AM

Well, for one, I'm not using a float library.  All that overhead can go away.

For two, I can use dirty shortcuts - e.g. sending the high byte to the DAC while the low byte(s) is(are) still being refined.

(Given a multi-byte DAC, and/or a serial interface).

For three, I can use ALL the registers as I choose to allocate them.

For four, I can fine-tune it for the hardware I have handy.  If I happen to have an external sine wave DDS chip, I can use it (I do: It's a 22V10, in case you care).  Or an external lookup table chip.

And finally, I know always, every time, *exactly* how many cycles it is going to take.  I've said this over and over again, that sometimes it's not that important to be 'the fastest', but to know 'exactly how fast you are', regardless of compilation options, compiler implementations, phase of the moon, or anything else.

Among other things.  Yeah, it's a complex pain.  But it does allow rather more flexibility, and sometimes that flexibility provides the speed you need.  S.

clawson wrote:

For example, here is the core of sin() for the fp lib in avr-gcc:

http://svn.savannah.gnu.org/viewvc/avr-libc/trunk/avr-libc/libm/fplib/fp_sinus.S?revision=2473&view=markup

And is it just me, or did that push ZL only to pop it to r0?  Use the Y ptr (YH and YL) and save two cycles, right there.  So there.    S.

Scroungre wrote:
it's faster in assembler

That old chestnut again!

It's possible that a skilled & experienced assembler programmer might be able to some up with a specifically-optimised solution for a particular requirement than using a general-purpose library.

But it's clear that is not the case here!

JDArduino wrote:
libraries to do all sorts of things

If you're just doing general-purpose stuff then, as Cliff said, this really is a job for a HLL !

https://www.avrfreaks.net/commen...

https://www.avrfreaks.net/commen...

And, because they are so widely used, it is likely that the AVR 'C' libraries are pretty well optimised anyhow.

You can, of course, just call 'C' libraries from assembler ...

Top Tips:

1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...

awneil wrote:

Scroungre wrote:

it's faster in assembler

That old chestnut again!

It's possible that a skilled & experienced assembler programmer might be able to some up with a specifically-optimised solution for a particular requirement than using a general-purpose library.

I have this funny feeling in my head that I AM a skilled & experienced assembler programmer.

And furthermore, that this last little bit came about because clawson asked why you even want to do trig in assembler, and I said because sometimes I can do it better and faster!  I wasn't referrin' to the OP there.

So.  We've gone all OT.  Anything else?  S.

Scroungre wrote:
I AM a skilled & experienced assembler programmer

I said because sometimes I can do it better and faster!

No, you didn't - you said, "it's faster in assembler" - without qualification.

It is important to give that qualification - it is a common misconception ("chestnut") that the act of simply writing in assembler - in & of itself - will inherently make your code "faster".

My point (and, I suspect, Cliff's) to the OP is that one would only do this in exceptional circumstances where there was a specific & compelling need - and one had the necessary experience.

Top Tips:

1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...

The post has been properly edited.  S.

Edited to note - I did not have to remove the italics, because Mr. Lawson put them in his quoting me.  Plbbht.

Last Edited: Mon. Jan 13, 2020 - 10:49 AM

Thanks!

@ JDArduino - this does still leave the question of why you, specifically, want to do this in assembler and re-invent this wheel,  rather than use a HLL which gives you all this already - and probably better!

Note that a HLL doesn't compel you to use its full-fat libraries:  you can still implement simplified/optimised solutions yourself in the HLL - or link to assembler routines for code that really does need to be in assembler.

https://www.avrfreaks.net/commen...

EDIT

See #18 & #19, below - the Atmel assembler doesn't support linking.

Top Tips:

1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
Last Edited: Mon. Jan 13, 2020 - 02:56 PM

IEEE float is implemented in avr-libc, you can use it / link against it from assembly code. Just obey the ABI. http://gcc.gnu.org/wiki/avr-gcc
.
Apart from that, Taylor (and Padé for that matter) are suboptimal, better use MiniMax instead.
.
For IEEE double, MiniMax performs better than CORDIC, cf https://www.mikrocontroller.net/topic/85256#6030049

avrfreaks does not support Opera. Profile inactive.

As a more useful piece of information to the OP, if you're making lookup tables and storing them (in, say, an external serial EEprom), it doesn't really matter how efficient the algorithm is.  Accuracy matters a bit more, and because you only have to do it once.  And then it's lookup tables, which are sometimes* faster, and it's the lookup you have to do at speed.  S.

* Sometimes it's faster to calculate.  I thought I should include that caveat.

PS - Can't say I've ever had an application that used 23 bits of the mantissa.  Maybe if you're aiming radio telescopes or something.  S.

IEEE float is implemented in avr-libc, you can use it / link against it from assembly code. Just obey the ABI. http://gcc.gnu.org/wiki/avr-gcc
.
Apart from that, Taylor (and Padé for that matter) are suboptimal, better use MiniMax instead.
.
For IEEE double, MiniMax performs better than CORDIC, cf https://www.mikrocontroller.net/topic/85256#6030049

avrfreaks does not support Opera. Profile inactive.

SprinterSB wrote:
you can use it / link against it from assembly code.
But only if you use avr-as and not Atmel Asm2 ;-)

clawson wrote:
not Atmel Asm2

I keep forgetting that, too!

So would it be fair to say that the Atmel assembler is really only for simplistic code - not suitable to "serious" projects ?

Top Tips:

1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...

awneil wrote:
So would it be fair to say that the Atmel assembler is really only for simplistic code - not suitable to "serious" projects ?
But that's true of most projects. You use Asm for tight/fast control apps in 0.5K..2K micros but above that most people would use HLL for more serious projects (mainly "cost of ownership"!) so for the kind of thing you might do exclusively in Asm it probably doesn't matter which one you use and the vast majority of support and knowledge of AVR Asm on the net is actually for the Atmel assembler not the GNU one.

I used believing there is no method as being the best. But there is always an optimum way to solve an applied problem which satisfies best a set of given conditions.

So would it be fair to say that the Atmel assembler is really only for simplistic code - not suitable to "serious" projects ?

That's rather unfair, there are some pretty complex things done in asm, I seen several program that can execute 100+ different commands of moderate complexity with a specialized parser & graphics generation.

For many years asm was used for everything under the sun, so it can certainly do the job.  It's more of is it suitable for a person to work through it.

Of course maintenance & support can be hair raising, so something like C may save both time & sanity.  Cost of ownership should certainly be considered.

Of course if OP wants to develop libraries....just using one seems a bit of a cheat!

for float in a C compiler are almost certainly implemented in hand-crafted Asm too

That makes me think of fancy eateries....like I'm being served hand carved roast beef & a side of code.    I started noticing everywhere I went they say "hand-made ice cream cones", "hand crafted deluxe grinders" (sandwiches)   hand-crafted leather belts, etc.

Soon it will be robots

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Mon. Jan 13, 2020 - 05:48 PM

I've been recommended to use a higher level language again and again, but please understand I'm not doing this to improve upon existing libraries or make giant projects. I'm simply doing this for fun as a challenge to learn new things about how algorithms work on the smallest scale and simply calling on existing libraries through C would not give me that same experience.

Until a few days ago I didn't know how floats, fixed point and most integer types were stored and used for arithmetic. Now I do, and even if that seems useless or like 're-inventing the wheel' I find joy in it.

JDArduino wrote:
learn new things about how algorithms work

But, surely, writing in assembler just obscures that?

Top Tips:

1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...

awneil wrote:

JDArduino wrote:

learn new things about how algorithms work

But, surely, writing in assembler just obscures that?

Too true.  To really understand electronic calculation, you should start with resistors, capacitors, inductors, and transistors.  If you ask nicely I'll let you use an op-amp.    S.

awneil wrote:

JDArduino wrote:
learn new things about how algorithms work

But, surely, writing in assembler just obscures that?

Not necessarily.

1) The basic arithmetic like +, -, *, /, ==  you want to write it in asm anyway, at least for avr and similar.

2) Properly commented asm is more legible than scrappy documented C/C++, in particular if you start fiddling with bits like here.

Re. implementing elementary functions:

For the most of them one has the choice between CORDIC and polynomial-based (polynomial, rational function) approaches.  The first thing to note is that from a specific precision onwards, polynomial based is always better that CORDIC.  This is because CORDIC converges only linearly, whereas polynomial based is better than that.  Where the break-even point is depends on what function you consider and whether it's about absolute errors (like with fixed-point) or relative errors (like with floating-point).

CORDIC only uses simple arithmetic like addition, comparison and shift, but in general you have a multitude of terms like with polynomials.  When the basic operations are expensive, like with IEEE float, polynomials perform better. Or you have to base it on fixed-point arithmetic with a bit-precision 50% higher or so than the target precision to achieve the required relative precision.  This would imply that you have several implementations of *: One high precision to be used as basic arithmetic during CORDIC so you achieve the required relative precision, one one for the arithmetic itself.

"precision" here always means w.r.t. worst-case error.  Good average precision is not enough for serious applications:  If the error is good only in 99.9% of the inputs, your rocket will fail soon.

As a general rule of thumb: Using rational approximations is usually only worthwhile when the function to be approximated is not smooth over R like arctan or arcsin.  For smooth functions like sin and exp, it's enough to use polynomials and there is no gain using rational functions (this might be different when dividing is cheap, which is not the case with AVR).

In many approaches you'll see Taylor, but that's suboptimal because that introduces a constraint that's of no use here: The n-th derivatives at the point of expansion.  Same with Padé.  As a consequence, use MiniMax polynomials instead of Taylor, and MiniMax rational functions instead of Padé.  These are optimal w.r.t. degree, i.e. amongst all such functions with the same degree, the worst case error over the considered interval is minimal (that's the definition of MiniMax).  What counts w.r.t. performance is number of terms to evaluate, hence Taylor might appear to be better when some symmetries apply, e.g. sin expanded at 0.  When you use MiniMax instead, then you have to introduce that symmetry by hand, for example by approximating f(x)=cos(sqrt(x)) instead of cos and then use cos(x)=f(x²) in the computation.

In term of code size, polynomials / rational functions are also preferable over CORDIC because it's only basic arithmetic which is re-used by them.  CORDIC on the other hand needs custom algorithms instead, adding to the overall code size (a program that is using sin is usually also using + and * and such, hence you can benefit from code sharing when using polynomials / rational functions).

avrfreaks does not support Opera. Profile inactive.

I remember, when I was a kid, reading that the ZX Spectrum ROM used Chebyshev polynomials to approximate the transcendental functions. I don't know why, but that weird Russian name just stuck in my mind forever.

I would see how the polynomial approach might make sense if my microprocessor could work with floating point numbers, but since it doesn't it actually costs quite a couple cycles to multiple 2 floating point numbers. So I figure that CORDIC is the way to go for at least the sine and cosine.

My current IEEE-754 multiply takes between 81 and 87 cycles and my divide1 between 426 and 456. I haven't counted my addition yet, but I think it will be a tiny bit faster than multiply. I just can't see a reasonable way to evaluate even a polynomial with floating points since it would quickly take entire milliseconds!

1 the return registers are wrongly documented

Last Edited: Wed. Jan 15, 2020 - 07:43 AM

Notice that IEEE requires a specific relative precision, i.e. the *relative* error must be bounded.  Fixed-point usually don't do this when you are close to a zero, take for example cos(π/2+ε) or sin(355).

Apart from that, when you perform argument reduction like for sin and cos, subtracting integral multiples of π/2 is not enough: When the resulting, reduced argument is close to zero, you are bitten by the fact that your representation of π is not exact, hence the argument will be dominated by π's representation error (or multiples of that).  In order to cater for all of this, extra care must be taken.  Try sin(355) for example.  The correct result is around -0x1f9bd0307d1de29p-72 where the latter notation is borrowed from C.

I don't actually know where the break-even point between polynomials and cordic is for floating-point, but it's the case that IEEE double is best in MiniMax (and takes advantage of code sharing).  What's best for IEEE single? I don't know. Maybe you can compare with avr-libc (which uses polys IIRC) with your implementation.  "comparing" means that you compare performance footprint /and/ worst-case errors.

Which polynomial(s) did you consider by the way?

avrfreaks does not support Opera. Profile inactive.

Last Edited: Wed. Jan 15, 2020 - 08:20 AM