Replacing small floats with integer maths

Go To Last Post
39 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Evening all,

 

I suspect I'm just being stupid, and the answer to this is simple.

 

I've done plenty of integer maths in a variety of applications to replace floating point arithmetic on small embedded platforms, without FPU hardware etc. For example:

 

var * 237 / 100; // Replaces var * 2.37

 

So much so - I've always avoided FP math, and always used integer math! However, I have an application right now where I am filtering analog data using a 4-pole Butterworth FIR filter - and the filter coefficients are all very small floating point values, i.e:

 

0.003284735
-0.00112346

 

So... in this case - I can't just scale the values up until they can be truncated into suitable positive integers - as the output values are so large (e.g. 10^300). For example, my input data is only between 0 - 4095 - so if I multiply 0.003284735 by 10,000 to give 32.84, truncated to 33, then that's OK, but the equation for a 4-pole Butterworth digital filter has multiple terms, which end up with me multiplying multiple large number by other large numbers - and ending up at 10^300 type number sizes.

 

So my question is - it's always been my school of thought that FP math should be avoided as far as possible - or is this one of those applications where that simply isn't true, and FP is the solution?

This topic has a solution.
Last Edited: Thu. Apr 15, 2021 - 10:22 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm dubious about your description of the problem.

What would you do with 10**300?

 

Are you evaluating a polynomial with real coefficients?

If so, and possibly if all else fails, factor the polynomial into factors of the forms

b[0], x-a[j] and b[j]+(x-a[j])**2 .

Sort the a[j]'s.

Between any consecutive pair of a[j]'s you should be able to find a scaling system that works.

Having multiple formulas raises the possibility of jump discontinuities at the boundaries.

Clearly that is not an issue near the zeros.

It might be an issue near the a[j]'s of the quadratic factors.

Moderation in all things. -- ancient proverb

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

jtw_11 wrote:
... as the output values are so large (e.g. 10^300).
Saturation is possible in fixed-point math.

jtw_11 wrote:
For example, my input data is only between 0 - 4095 ...
Precision out is, at best, precision in.

Noise will have some effect.

jtw_11 wrote:
... - or is this one of those applications where that simply isn't true, and FP is the solution?
Probably no.

Is Butterworth required?

 


Saturating :

GitHub - johnmcfarlane/cnl: A Compositional Numeric Library for C++ via The Embedded Muse 361 - Tools and Tips

Non-saturating :

Fixed-Point Support - avr-gcc - GCC Wiki

 

Another Look at Fixed versus Floating Point

[next to last paragraph]

...

However, there are some applications where the extra performance of floating point is helpful, and may even be critical. 

[audiophile, IIR]

...

The Chebyshev and Butterworth Responses

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

 

Thanks both,

 

I am evaluating a linear difference equation with real coefficients, yes.

 

The linear difference equation is that given in SAE J211 to realise a 4-pole (when run twice, once forwards and once backwards, to avoid phase shift) Butterworth filter required to achieve a CFC60 filter class. The requirement is to filter a number of 12-bit analog channels to the CFC60 specification which requires use of a particular Butterworth filter (strictly, other methods have been used successfully to achieve CFC60 filtering, such as FFT).

 

The application is not a high-precision system - simply that the coefficients mandated by SAE J211 for the required channel frequency class are very small floating point values.

 

Fundamentally I'm trying to move a perfectly functioning application running on x86 over to a small embedded target, where fundamental application re-design is not possible. You're right, I have no need for 10^300 numbers - it is simply that if I scale the coefficients up large enough to use fixed-point, the SAE J211 given Butterworth filter generates values of this magnitude.

 

Running on x86 with floating point, all my values are low 1000s. Perhaps I'm being dense - the bit I'm struggling with is how to use fixed-point math in place of floating-point when one of the multiplicand is < 1.

 

EDIT - Formatting sorted on desktop, mobile formatting is horrible!

 

EDIT 2 - For those interested furthur, the following DIAdem page includes excerpts from the SAE standard https://zone.ni.com/reference/en... (though the full standard is a Google away...). If I were only multiplying say 8 * 0.00323, then handling that in fixed point is no challenge at all. The difference equation used by the filter however fundamentally ends up generating massive outputs as it multiplies numbers in the range of a few thousand, by the coefficients which when scaled up are at biggest circa 18,000 - then multiplies that result, by similarly large coefficients.

 

 

For example, the first 0.002s of my dataset are below, and the coefficients - processed in floating point. This results in a CFC60 filtered output (with poor filter start-up handling, but of no importance in this application as start and end data is disregarded). However, if I simply scale the filter coefficients up from the floating point values, to sensible fixed point values, as per below...

 

Time	ADC value	Second Pass Backwards (Filter output)
0	2048	0
0	2048	0
0	2048	1177.662965
0.0001	2048	1110.518683
0.0002	2048	1041.605985
0.0003	2048	971.1669512
0.0004	2048	899.4820646
0.0005	2048	826.8728888
0.0006	2048	753.7047966
0.0007	2048	680.3897317
0.0008	2048	607.3889903
0.0009	2048	535.2160079
0.001	2048	464.4391325
0.0011	2048	395.6843663
0.0012	2048	329.6380519
0.0013	2048	267.0494808
0.0014	2048	208.7333976
0.0015	2048	155.5723706
0.0016	2048	108.5189988
0.0017	2048	68.59792031
0.0018	2048	36.90758736
0.0019	2048	14.62176711
0.002	2048	2.990727794

 

T	0.0001
wd	785.3981634
wa	0.039290107
b0	0.001460316
b1	0.002920633
b2	0.001460316
a1	1.889033079
a2	-0.894874345

to

wd	7853981
wa	392
b0	14
b1	29
b2	14
a1	18890

 

Then the same algorithm generates values so huge - that even if I run this in Excel, the numbers get so big, so quickly - that Excel won't even display them! I get to approx 10^300 during the first pass before Excel returns #NUM!, and nothing on the second pass - as the first pass values * almost any value = #NUM!.

 

Time	First Pass Forwards	Second Pass Backwards (Filter output)
0	0	0
0	0	0
0	116708	#NUM!
0.0001	2204730784	#NUM!
0.0002	4.16463E+13	#NUM!
0.0003	7.86679E+17	#NUM!
0.0004	1.486E+22	#NUM!
0.0005	2.80698E+26	#NUM!
0.0006	5.30226E+30	#NUM!
0.0007	1.00157E+35	#NUM!
0.0008	1.89192E+39	#NUM!
0.0009	3.57375E+43	#NUM!
0.001	6.75064E+47	#NUM!
0.0011	1.27516E+52	#NUM!
0.0012	2.40873E+56	#NUM!
0.0013	4.54997E+60	#NUM!
0.0014	8.59467E+64	#NUM!
0.0015	1.62349E+69	#NUM!
0.0016	3.0667E+73	#NUM!
0.0017	5.79285E+77	#NUM!
0.0018	1.09424E+82	#NUM!
0.0019	2.06697E+86	#NUM!
0.002	3.90441E+90	#NUM!

 

Edit - the really keen eyed amongst you will notice the DIAdem page shows coefficients a0, a1, a2, b0 & b1 - whilst my calculations show b0, b1, b2, a0 & a1. The 'modern' ISO equivalent of SAE J211 handily flipped a and b.

 

Last Edited: Wed. Apr 14, 2021 - 11:49 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Here is a longish and fairly detailed thread about fixed point arithmetic that might provide some insights:

 

https://www.avrfreaks.net/forum/...

 

Jim

 

Until Black Lives Matter, we do not have "All Lives Matter"!

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ka7ehk wrote:

Here is a longish and fairly detailed thread about fixed point arithmetic that might provide some insights:

 

https://www.avrfreaks.net/forum/...

 

Jim

 

Thanks Jim, will read and absorb now!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

jtw_11 wrote:
Fundamentally I'm trying to move a perfectly functioning application running on x86 over to a small embedded target, where fundamental application re-design is not possible.
Had a similar issue thirty plus years ago (absolutely had to run on a 16b MPU, algorithms trialed on a 32b minicomputer, many scaling issues)

jtw_11 wrote:
the bit I'm struggling with is how to use fixed-point math in place of floating-point when one of the multiplicand is < 1.
Some computer languages have fixed-point math; such may be arriving to C++23 (CNL)

jtw_11 wrote:
EDIT - Formatting sorted on desktop, mobile formatting is horrible!
Better on a tablet than a smart phone?

 


Fixed-point types | More about types — learn.adacore.com

 

A smart phone has a docking station :

PINEPHONE USB-C Docking Bar - PINE STORE

 

"Dare to be naïve." - Buckminster Fuller

This reply has been marked as the solution. 
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

gchapman wrote:

Had a similar issue thirty plus years ago (absolutely had to run on a 16b MPU, algorithms trialed on a 32b minicomputer, many scaling issues)

 

Then you feel the pain smiley! I have no hard constraint - I'm simply working out pre-hardware if I need to go 32-bit or not.

 

So, I've cracked it (in Excel at least, running scaled integer maths). I simply perform x * scaled coefficient / scaling factor, for each term in the difference equation, rather than scaling at the end. This way I avoid saturation - HOWEVER, and interestingly, I suspect I'll actually need more CPU cycles that simply using floating-point, as I've now got 10x 32-bit divisions (in addition to 10x 32-bit multiplies) to do for each run through the algorithm, and we all know 32-bit division isn't the AVR's strong point!

 

...I think I may have just stumbled across my first example of floating-point genuinely being the faster solution over fixed point.

Last Edited: Thu. Apr 15, 2021 - 12:31 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

jtw_11 wrote:
So, I've cracked it (in Excel at least, running scaled integer maths).
Well done!

jtw_11 wrote:
I'm simply working out pre-hardware if I need to go 32-bit or not. ... and we all know 32-bit division isn't the AVR's strong point!
fyi, AVRxt is a bit faster CPU and some AVRxt are significantly faster clock; AVRxm is present though with a lead-time issue.

Small SAM are one way to follow-on AVR and might consider PIC24/dsPIC (dsPIC33C are fast and recent); these have hardware stack overflow detection whereas by OCD on some AVR.

 

Instruction Set Summary | AVR® Instruction Set Manual

[Table 1]

XMEGA Lead Time, Dec'20 | AVR Freaks

New Digital Signal Controller (DSC) Accelerates DSP Performance for Time-Critical Control Applications | Microchip Technology | Microchip Technology

Build Larger, More Robust Applications with Microchip’s Expanded Dual- and Single-core dsPIC Digital Signal Controller (DSC) Family | Microchip Technology | Microchip Technology

Fixed-Point Math Functions | MPLAB XC16 Libraries Reference Guide
CPU Performance Benchmark – MCU Performance Benchmark – CoreMark – EEMBC Embedded Microprocessor Benchmark Consortium (enter Microchip, sort CoreMark)

Are We Shooting Ourselves in the Foot with Stack Overflow? « State Space

Stack Overflow Detection Using Data Breakpoint | Microchip Studio

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

jtw_11 wrote:
I get to approx 10^300 during the first pass...

That would exceed the range of a 32 bit float, max is approx 10^^38.

A 64 bit double is max approx 10^^308

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

 

If I take your:

0.003284735

and put it into here:

 

https://www.calculatorsoup.com/c...

 

it tells me:

 

 

so if your max value is 4095 that should all fit into a uint32_t shouldn't it? 656947 * 4095 = 2,690,197,965. The max value in a uint32_t is 4,294,967,296. Then the 20,000,000 divisor also fits into uint32_t so for a multiplication by 0.003284735 it would surely just be:

uint32_t result = ((uint32_t)ADC * 656947) / 20000000;

and that should not overflow. By the same token:

 

 

which again fits into uint32_t without overflow.

 

(but if some factor was huge/tiny there's always the possibility to spill into uint64_t.

 

Last Edited: Thu. Apr 15, 2021 - 08:46 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

jtw_11 wrote:
I think I may have just stumbled across my first example of floating-point genuinely being the faster solution over fixed point.

On a device with FPU, it is quite likely that floating-point will be faster than fixed-point.

 

if I need to go 32-bit or not

If you're considering one with FPU, be sure to check whether it's single or double - if you need double, and the FPU is only single, it'll end up being done in software again.

 

See, for example: https://mcuoneclipse.com/2019/03/29/be-aware-floating-point-operations-on-arm-cortex-m4f/

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

BTW 0.003284735 * 4095 is 13.450989825 but if you are working with integers that is just 13. So your entire 0..4095 reading range just got granularised into 14 steps which is going to make things very coarse from this point on!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just some fast comments (I have not read the link from #5).

 

Since it's a FIR filter (and not IIR) you can calculate and scale so it don't overflow but use the dynamic range. (send a tone with max amplitude with freq of max gain of the filter).

(if you use calculations that can sense overflow you can probably use a gain of about 125% using saturation )

do the calculations with values of positive and negative coef's. separate.(gain a bit for free).

since you only have 12 bit input don't overkill the accuracy of the filter I guess 16 bit make sense. 

Last Edited: Thu. Apr 15, 2021 - 09:14 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
Saturating :
Opus codec

The creators of Opus had the difficulties of wide precision (PSTN to band practice by VoIP), portability (C89), and real-time (frame duration of at least a few milliseconds)

Opus is on PIC32 via the Harmony framework.

 

https://github.com/xiph/opus/blob/master/celt/fixed_generic.h#L108

https://github.com/xiph/opus/blob/master/README#L136

[portability]

[fixed-point is reduced quality and slower than fast FPU]

Audio and Speech | Microchip Technology

 


Opus Codec

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Keep in mind you can trade accuracy for speed by making sure you divide by a factor of 2 which should be much more efficient.

 

var * 237 / 100; // Replaces var * 2.37
var * 303 / 128; // Replaces var * 2.367188
var * 2427 / 1024; // Replaces var * 2.370117

 

C: i = "told you so";

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

No one has asked yet - so I will.

Does the AVR implementation using floats run fast enough for your application ?

 

If it does, then your work is done.

avr-libc is blessed with a small footprint highly performant floating point library. Where applicable; you should not feel guilty using it,

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Why not just use a factor of 2**16 on all the coefficients?

All the products will fit in 32-bit.

If ordered right, so will all the intermediates.

Moderation in all things. -- ancient proverb

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have not checked the speed of the GCC float lib. But from a logic math point of view it should be faster (or as fast) to do a filter with 32 bit float than 32 bit int, (if you can avoid that the lib normalize the results that isn't needed for this, after all 32 bit float only use 24bit for the real calculation)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It sounds like you are doing something inherently wrong---you should be able to readily scale things so they fit easily into 32 bit fixed values (that's1 part in 4 billion if you use the full range).  If you are getting wildly large values, like 10^40 (that's a 40 digit number!!), you are first off doing something completely wrong, before you even get to the question of floating vs integer. 

 

If part of a summation gives 573456 and another part gives  0.00002311, one part will be going into the trash!

 

Also don't get tripped up by not paying close attention to the fake decimal point.

for example 

 xyx = xyz * 0.31      ...repeat this 20 times, you will generally get a very very small result.

 

 xyx = xyz * 3.10      ...repeat this 20 times, you generally will get a very very LARGE result.

 

Say you used "31" to represent 0.31, ensure you are not mistakenly implementing it as 3.1 in your math system  ...you have to pay close attention to what you are doing.

That is one advantage of floating point, you can be rather sleepy & inattentive and likely still get a proper result.  Integer methods require care & attention to detail. Don't forget to propagate any carry's from the low bytes to high bytes :)

 

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Thu. Apr 15, 2021 - 10:03 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
and put it into here:

 

A very useful tool, thanks!

 

clawson wrote:

BTW 0.003284735 * 4095 is 13.450989825 but if you are working with integers that is just 13. So your entire 0..4095 reading range just got granularised into 14 steps which is going to make things very coarse from this point on!

 

Indeed, I've actually ended up scaling up by 100,000 to prevent introducing horrific granularity into the computation.

 

cpluscon wrote:

Keep in mind you can trade accuracy for speed by making sure you divide by a factor of 2 which should be much more efficient.

 

var * 237 / 100; // Replaces var * 2.37
var * 303 / 128; // Replaces var * 2.367188
var * 2427 / 1024; // Replaces var * 2.370117

 

 

Yes this makes perfect sense to me now, having looked at the disassembly for division by powers of 2; seems this is done simply by shifting (I've also learned, that those architectures with a barrel shifter can do this is a single cycle, rather than a cycle per shift!).

N.Winterbottom wrote:

No one has asked yet - so I will.

Does the AVR implementation using floats run fast enough for your application ?

 

If it does, then your work is done.

avr-libc is blessed with a small footprint highly performant floating point library. Where applicable; you should not feel guilty using it,

 

Aha - the million dollar question. Well, truth be told - yes. The application processes multiple channels at circa 2kHz - so I'm at approx 70% CPU utilisation, but yes - floating point is fast enough! Though, I'm still inclined to try and bring hte 70% number down a little...

 

sparrow2 wrote:

I have not checked the speed of the GCC float lib. But from a logic math point of view it should be faster (or as fast) to do a filter with 32 bit float than 32 bit int, (if you can avoid that the lib normalize the results that isn't needed for this, after all 32 bit float only use 24bit for the real calculation)

 

Correct, this is what I have found - so a worthwhile educational exercise to have gone through this - but float does appear to be the best solution.

avrcandies wrote:
It sounds like you are doing something inherently wrong

 

Well, I think you're right! I've marked my post as solution - fundamentally the issue was needing to scale back each term of the difference equation individually, not the final value.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

...though the last thing I haven't tried thinking more about Cliff's answer. How would I go about calculating (or does anybody know of such a calculator, with an explanation) a denominator with a power of 2. That way, using shifts alone to perform the division - I may be able to bring the cycle consumption down.

 

https://www.calculatorsoup.com/c...

 

For example, for the below small float the above calculator generates these integers, but is there a way that I can 'force' the denominator to be a power of 2?

 

 

0.00274522=137261/50000000

 

As a fudge, I could always pick the nearest power of 2 to 50,000,000 - and transpose to calculate the numerator.

 

So, 67,108,864 is the nearest power of 2.

 

0.00274522 * 67,108,864 = 184,228.59563008, round to 184229. This calculates 0.00274522602558135986328125.

 

Then I simply need to perform 32 bit multiples, and divide by shifting powers of 2.... Interesting! Will go count some cycles!

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

How would I go about calculating (or does anybody know of such a calculator, with an explanation) a denominator with a power of 2.

You seem a bit uncertain...you don't have to do much except to apply (multiply) using a scale factor that's some power of 2 to your numbers...the key is that a power of 2 easy to remove (undo) later on--(either partially or fully undone). 

 

For example you can mult by 1024 (10 bits) & later divide  by 256 (8 bits) to pick up 2 bits of extra resolution in your answer.  Dividing by 256 is especially great since that means you are simply discarding/ignoring/tossing the lowest byte from a multibyte result ---so you barely have to do anything at all to divide by 256, you don't even have to shift.  For example  0x4C7A /256(dec)  is just  0x4C  !!  0x2AF45B18 /256(DEC) is 0x2AF45B

You can do similar when dividing by 65536---two bytes got tossed.

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Fri. Apr 16, 2021 - 01:17 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

jtw_11 wrote:
Indeed, I've actually ended up scaling up by 100,000 to prevent introducing horrific granularity into the computation.
That *will* then take you into uint64_t territory. 

 

I'd just stick with float. If you start doing 64 bit integer mathematics there's every chance you might start finding it taking as long as doing this with float.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:

That *will* then take you into uint64_t territory. 

 

I'm not following, why would this be the case? The largest filter coefficient * 100,000 is only 188903.3079 (rounded to 188903). 188903 * 4096 fits within a 32bit int?

 

In the end, I've settled on scaling the raw 12-bit ADC readings up by 128, and performing power of 2 division on scaled multiples of the coefficients. This method keeps all my values within 32bits, and looks like it'll take approx 60% of the cycles using float would. Just about to start running some tests in the simulator first.

 

Scaling many of the coefficients by up to 2048, I've been surprised at just how small the loss of precision is; fluctuating between +/- 0,05% across my whole dataset. Will report back results on int vs float.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

scaling the raw 12-bit ADC readings up by 128

Sorry but I don't really see how adding 0's to a integer number actually add any information! (do all the shifts on the coefficients )  

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

scaling the raw 12-bit ADC readings up by 128

Sorry but I don't really see how adding 0's to a integer number actually add any information! (do all the shifts on the coefficients )  

 

It adds nothing, but prevents loss of something.

 

I was baffled at first too... After much analysis last night however - any loss of precision and thus error (error between the coefficients when calculated using floats, or as /power of 2 integers) greater than x10^-5 in the multiplication of the coefficients results in enormous errors in the output dataset, as the algorithm given in SAE J211 multiplies the terms with the errors again and again, resulting is rapid propagation of huge filter output errors.

 

For example, imagine a raw data sample of 2047... Say my coefficients have been scaled up by 2048, so 0.001460316 becomes 2.99, rounded to 3.

 

2047 * 3 / 2048 = 2.99. For this input data, the loss of resolution is not significant.

 

Pick another random input... 1800 * 3 / 2048 = 2.63, rounds to 3 - enormous error. Now, I can get around this by increasing the scale factor up to very high powers, OR - what I've noticed requires less scaling, and less shifting - is scaling the input up.

 

1800 * 128 = 230,400.

230,400 * 3 / 2048 = 337.5

 

I can then simply work with leaving my data scaled up, and when offloading to PC, scale it all back again - and therefore reduce the number of shift operations in the maths in the embedded target massively.

 

337.5 / 128 = 2.63, no loss of precision when the data is extracted from the embedded device.

 

 

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

A fact which is easily forgotten when scaling numbers to increase integer precision: you need to (mentally) scale your expectations as well...

 

That's a bit gnomic, I know. But when you scale on this kind of thing, while you are scaling, you are also changing the representation of your values; what your values actually mean.

 

As an example, with something like a digital filter, there is always some data at the front which has been collected from some sort of *usually* analogue system - and the two critical points here are (a) that there is a maximum and minimum value for that data, and (b) that this value is represented by a maximum number of bits - maybe 8, 12, 16, 24... the actual number doesn't matter but the fact that there *is* a maximum does.

 

The simple (best) approach is this: take that value range and treat it as if all the values represent points between zero and one (you may need to consider negative values: in most such cases, simply add the half-scale value to the input, so 'zero' is the half scale value, with maximum negative and positive values at 0x00... and 0xff... respectively. So your initial scaling may well be no scaling at all; it's just a change in how you understand the numbers. Or it may be that you provide an initial multiplication to use up all the bits in your value range - for example, multiplying a 0-4096 input by a constant 16 (shift left four bits) to fill a 16 bit variable. The point is, that from this point on, all your values can be considered as being fractional values between 0 and 1 (minus 1 bit). Their initial real world values are irrelevant at this point.

 

Now think about what happens when you add two of these numbers together: Your values are all in the virtual form 0.xxx where the 0 and the point are virtual. When you add two maximum values together the result will be 1.yyy; you have acquired a value which is not in your representation... damn.

 

// assume an eight bit input...
uint8_t v1 = 0xff;    // 0.99609375
uint8_t v2 = 0xff;    // also 0.99609375

// added together
uint8_t res8 = v1 + v2;    // res8 = 0xfe = 0.9921875, oh dear...
// but
uint16_t res16 = v1 + v2;   // res16 = 0x1fe = 1.9921875, correct answer
// we have created a virtual binary point between the two bytes of the uint16

Addition of two fractional values like this *always* produces an extra bit in the answer, even if that extra bit is a zero and you can't see it.

 

Now consider multiplication... when you perform a binary multiplication, the result always requires the same number of bits as the total number of bits in the multiplicand:

 

// again, the inputs are 8 bits
uint8_t m1 = 17;    // = 0.06640625
uint8_t m2 = 88;    // = 0.34375

uint8_t res8 = m1 * m2;    // 0xd8; 0.84375, wait, what? The result doesn't fit...
// but
uint16_t res16 = m1 * m2;    // 0x5d8; now we have kept all our result, but we have also
                             // messed our scaling... the result is now 0x5d8 / 0x10000
                             // which is 0.022827148 in decimal

What's happened here? Obviously the result won't fit in the same size result; we need to use a larger variable to hold the result. In this case, 2 * 8 = 16... but by doing this we have moved the binary point as well. Our inputs were always below one, so the product must also always be below one, so the binary point has moved eight bits to the left. We have acquired an extra eight bits of precision, which we can either (a) just ignore, (b) round, or (c) keep for later.

 

This is always the case. If the input is scaled 0-1 to fit into *any* variable size, addition *always* requires an extra bit for the result; multiplication *always* requires as many more bits as the smaller of the operand variables (not the value of the operand!), At the end of the calculation, you need to scale the result back to something you can use. Here's a generic FIR filter example. The algorithm requires the multiplication of each of a number of past samples with an equal number of filter coefficients and the totals summing; the past samples are then shifted by one and a new sample added.

 

// FIR filter generic example
// input values scaled 0-1 in 16 bits
// filter coefficients scaled 0-1 in 16 bits
// WARNING - in a real filter we need positive and negative coefficients

#define NUMBER_OF_TAPS 7

uint16_t inputs[NUMBER_OF_TAPS] = { 0 }    // some inputs will go here
uint16_t coeffs[NUMBER OF_TAPS] = {...}    // coefficients go here

uint16_t fir_filter (uint16_t new_sample)
{
    uint64_t acc = 0;                        // holds the 64 bit filter sum
    for (int q = 0; q < NUMBER_OF_TAPS; q++)
    {
        acc += (inputs[q] * coeffs[q]);
        if (q < (NUMBER_OF_TAPS - 1))
        {
            inputs[q] = inputs[q + 1];       // do the shift at the same time
        }
        else
        {
            inputs[q] = new_sample;
        }
    }
    // now scale the final result back to 16 bits
    // the multiplications have given us 32 bit results, which have been summed
    // the summing may have placed data in the top 32 bits which we need to account for
    acc /= NUMBER_OF_TAPS;                   // now we have a 32 bit result
    return acc >> 16;                        // discard the lower 16 bits (or round them)
}

 

There's a bit of magic going on at the end there; the division by the number of taps. This is because we are using a 0-1 representation and we can't handle a result of more than one - the binary point is between the upper and lower 32 bits of the 64 bit 'acc'. In a worst case, we might end up with a value of almost seven in the accumulator (in this case) so we scale the result back to 0-1 by this division. Our filter therefore has a gain of 1/number of taps, which might not be what we want, but won't have any sudden discontinuities in the output. With a real-world filter e.g. low pass or high pass, the result will never use those higher bits and they can be ignored *for the output* though they are still required in the sum (unless you choose to use saturate maths which limits the output range).

 

Another point I skipped over: what about signed arithmetic? A real world filter will always have both positive and negative coefficients.

 

The same rules apply, but there are caveats:

  • the range for the input values is now +/- 0.5
  • the top bit is a sign bit, but the values have the same precision
  • when you multiply two signed numbers:
    • the number of output bits is still the total number of input bits (int8 * int8 = int16)
    • as your input is now -0.5 to 0.5, the result of a multiplication is +/- 0.25. Effectively you have two sign bits...
  • when you add two signed numbers:
    • as with unsigned, summing can still overflow and needs to be accounted for

 

 

Meh, it's too early in the morning for this, but I think I got everything in that should be. The basic way of maintaining sanity is:

  • scale the input to either 0-1 or +/- 0.5
  • do the same for the filter coefficients
  • remember that addition can overflow
  • scale the output back to your required precision

 

Hope this helps someone,

 

Neil

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There is something I don't get, so I try an other way:

 

you say that you shift the raw data by 7 (mul 128), that don't do anything for precision !

It's the coefficient that need to be bigger!

(1800* (128*3) )/2048 

Same result here but I bet 384/2048 could be better (I guess) perhaps 385/2048 or ...

 

And when you use shifts do them byte size (n*8) the AVR don't have a barrel shifter 

 

 

 

  

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hello,  I recommend that you transfer your circuit design to a "Black Pill"; a 32-bit ARM CPU based on the the STM32F411CEU6.   This CPU has internal single-precision floating-point hardware circuitry that does FP in one-two cycles of 100MHz system clock.  Plus all the flash and SRAM that you could need.  It sells in the same relative price range as high end AVRs, as seen in this eBay listing: https://www.ebay.com/itm/STM32F4...

Plus it is programmable in the Arduino Hardware Abstraction Language ("open the pod bay doors, HAL") so you don't have a huge learning curve intrinsic to changing processor platforms.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Regarding Neil's comment on the need to have more bits when adding, avr-gcc is equipped for this, by providing fixed-point types.   This includes the "fract" type for numbers < 1 and "accum" types for adding these together.  The thing I don't like is that the fract sizes are 1, 2 and 4 bytes and the accum types are 2, 4 and 8 bytes.  See code snipped below, and notice the use of the literals 0.91001lr and 0.01k.

 

For more info search the web for "ISO/IEC TR 18037" or "N1169". 

 

I played with this a bit, and came to conclusion that for these digital signal processing applications generating a fast routine to compute y = a*x + y is the way to go, where a, x, y are fixed point or floating point.  I think a lot could be saved by eliminating rounding in the FP computations, or using 3/4 fract/accum sizes in fixed point.  Note that the FP routines in avr-libc perform 40-bit FP computations (to process intermediate results) and rounding.

#include <stdfix.h>

long fract af[][6] = {
  { 0.91001lr, 0.80001lr, 0.70001lr, 0.60001lr, 0.50001lr, 0.40001lr, },
  { 0.91001lr, 0.15001lr, 0.05001lr, 0.02001lr, 0.00901lr, 0.00301lr, },
};

long fract xf[][6] = {
  { 0.81001lr, 0.70001lr, 0.60001lr, 0.50001lr, 0.40001lr, 0.30001lr, },
  { 0.81001lr, 0.05001lr, 0.04001lr, 0.01001lr, 0.00801lr, 0.00201lr, },
};

int main() {
  uint8_t i;
  long fract *a, *x;
  long accum y;

  y = 0.0lk;
  dump(y);

  a = af[0]; x = xf[0];
  y = 0.0;
  for (i = 0; i < 6; i++) {
    y += a[i]*x[i];
  }
  dump(y);

  while (1);
}

 

Last Edited: Sat. Apr 17, 2021 - 02:44 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

MattRW wrote:
The thing I don't like is that the fract sizes are 1, 2 and 4 bytes and the accum types are 2, 4 and 8 bytes.
Some computer languages have had fixed-point arithmetic for sixty years such that the scaling will be a match to the application.

Am glad that CNL exists for C++ though am unaware of a C++ compiler for PIC24/dsPIC (PIC32 it is by MPLAB XC32++)

For PIC24/dsPIC, Ada can be converted to C though not at zero price.

 

fixed point numbers

GitHub - johnmcfarlane/cnl: A Compositional Numeric Library for C++ (CNL)

2.1. Supported Constructs | 1. Getting Started — GNAT Pro Common Code Generator User's Guide Supplement 22.0w documentation

 


Geek Like Me, Too: Fixed Point Math with AVR-GCC

likewise PIC32 :

ECE4760 fixed point (Cornell University, bottom for the MAC benchmark)

[3/4 page]

--2.30--

...

The dynamic range is sufficient for ButterworthIIR filters, made with second order sections (SOS).

..

 

Fixed-Point Math | The Embedded Muse 359

Tools and Tips | The Embedded Muse 360 (last for SystemC on C++)

The Embedded Muse 361 - Tools and Tips (CNL on C++)

About SystemC

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

There is something I don't get, so I try an other way:

 

you say that you shift the raw data by 7 (mul 128), that don't do anything for precision !

It's the coefficient that need to be bigger!

(1800* (128*3) )/2048 

Same result here but I bet 384/2048 could be better (I guess) perhaps 385/2048 or ...

 

And when you use shifts do them byte size (n*8) the AVR don't have a barrel shifter 

 

Increasing the multiplier just made a big difference - I was using 60 cycles to multiply the incoming data by 128, and have reduced that to 19 cycles to multiply it by 256; gcc generates the following at -O3, just 3 byte moves in between loads and stores. Thanks very much! One thing I don't understand about the generated machine code, why clear R24? It's not like we're restoring a previous value that was in R24. The next thing that needs to use R24 is going to write to it - so why the seemingly useless CLR R24?

 

		array[0] *= 256;
00000066  LDS R24,0x0100		Load direct from data space
00000068  LDS R25,0x0101		Load direct from data space
0000006A  LDS R26,0x0102		Load direct from data space
0000006C  LDS R27,0x0103		Load direct from data space
0000006E  MOV R27,R26		Copy register
0000006F  MOV R26,R25		Copy register
00000070  MOV R25,R24		Copy register
00000071  CLR R24		Clear Register
00000072  STS 0x0100,R24		Store direct to data space
00000074  STS 0x0101,R25		Store direct to data space
00000076  STS 0x0102,R26		Store direct to data space
00000078  STS 0x0103,R27		Store direct to data space 

 

I need to put a good example in here of why I'm scaling the input data... I know it doesn't gain my added precision out of thin air, of course - but not going a good job of explaining why I'm doing it!

 

EDIT - and just pulled another 118 cycles out of each computation of the filter, but bringing one of the coefficient scales down by 1 power to a byte width, for the loss of approx. 0.01% precision.

Last Edited: Sat. Apr 17, 2021 - 07:43 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

R24 is the low byte of your 32 bit number when you mul with 256 that become zero.

That is why you don't gain anything! 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

R24 is the low byte of your 32 bit number when you mul with 256 that become zero.

That is why you don't gain anything! 

 

I don't understand, sorry! R24 is shifted into R25, I get that - making R24 zero? So why clear a zero byte?

 

What do you mean by I don't gain anything/

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

R27:R26:R25:R24 represent your 32 bit number. And that don't change, so when you <<8 (mul with 256) R24 become 0.

 

The compiler think a 32 bit number as  32 bit all the time. So it don't know that R24 aren't needed (because it's a part of the number).

 

If you want better code you will need to write in ASM, so the clr R24 can be removed and the sts 0x0100 R24 can be changed to sts 0x0100 R1 (or what ever register that always hold 0 (zero) I use R15).

 

You don't gain anything by mul a integer with a constant it's the coefficient that get more accurate when you mul with a constant.  

 

 

Add:

your b0 is:

0.001460316

and you use 3/2048 = 0,00146484375 

 

but if you use 766 /2048 = 256 * 0,001461029052734375

 

which is a better number. 

 

 

 

Last Edited: Sun. Apr 18, 2021 - 06:55 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

 

Add:

your b0 is:

0.001460316

and you use 3/2048 = 0,00146484375 

 

but if you use 766 /2048 = 256 * 0,001461029052734375

 

which is a better number. 

 

 

I must be missing something, as the above approach increases the error in the filter output (compared to just using floats in Excel) to up to 10%. What I really don't understand however, is the error between one simple example calc as per your reply above, reduces significantly (as we'd expect) - but in the real run through of the filter, the total opposite happens.

 

The only observation that I cannot make fully fit so far, is scaling up a coefficient such as below by 2048 yields a number very close to an integer (2.99 to 3), if we scale this up by 256 though, the resultant number is not close to the nearest integer.

 

b0	0.001460316	2048	2.990727794
b0	0.001460316	524288	765.6263152

Scaling the input up instead, to allow the coefficients to scale by powers of 2 that result in almost integer outputs, my filter runs with almost zero error - removing the input scale, and scaling the coefficients up further (and them not being close to an integer) results in enormous filter errors, back to my early problems above generating numbers around 10^300.

 

For the time being, I have a solution that fits comfortably in the number of cycles I need to select an MCU and progress the application - though I suspect I'll want to revisit this in the future to see if I can reduce the cycle count further.

 

Cheers!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
The compiler think a 32 bit number as  32 bit all the time. So it don't know that R24 aren't needed (because it's a part of the number)

 

Indeed, that's exactly my point... if I'm shifting left I'm shifting in zeros from the right, so agreed R24 becomes zero.  There are only 3 MOV operations, and instead of moving the low byte - it gets cleared? Answered my own question as I typed this response. The MOV operation moving R24 into R25 doesn't clear R24; so if we don't explicitly clear R24, we end up with the 3 low bytes of my original numbers shifted left 1 byte, with the lowest byte still in place, hence the need to clear it.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I really suggest that you with pen and paper (or simulations in Excel) understand the dynamic range of ALL the parts of the calculations, and which noise each calculation have.

 

And just a fast one compared to what you do my guess is that a optimal solution (ASM) will be more than 10 times more efficient, but since your sampling rate are slow perhaps it don't matter.