double/float questions

Go To Last Post
33 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Up til now I have not used double or float.

I wonder what the difference is between the two...
In the manual it states somewhere that both are 32bit I would have expected that they would have different lengths for different maximum alowed numbers.....

Also I would like to know if I can use sprintf function to convert the value to a string and thus being able to show the data on a LCD.

regards

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Strangely enough when it says that double and float are the same, it means what it says.

So you call regular (which have double arguments) the same as if you had 64-bit doubles.

Just remember that you will have a poorer precision. Not that this often causes any problems.

If you want to do GPS calculations you need to apply some care. Judicious use of fixed point arithmetic for some calculations and then return to regular floats.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There is a slight difference. While the code will be identical whether you use "float" or "double" on the AVR if you ported the same source to a GCC architecture that had 64 bit doubles (like x86) then the "double" code would undoubtedly be more accurate than the "float" code.

Also if 64-bit IEEE754 "double" is ever added to the AVR compiler then the same distinction would be seen. This might be seen as a bonus or it might break existing code if you were reliant on the results that the more limited 32 bit was producing - you might even have added in your own routines (rounding etc) on the basis that there were just 8.5 digits of accuracy.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks on the double float part....
I indeed intend to use it to do GPS math.

so for possible future accuacy I can better use the float instead of the double.....

The only thing I'm now then still very interested in is if I can use sprintf to convert the float number to a string format and then forward this string to a display.

my intension is to have 2 coordinates and calculate the distance between them. My reference is google earth and first put out parts of the calculation and then hope to do inbetween results displaying to be able to check if my calculations are correct.

regards

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes. You can use sprintf() for formatting a string. Just make sure you link with the correct library.

If you do use the GPS strings, do the math in fixed-point. You will then get sufficient accuracy.

Regarding portability of code: if you use "#lf" in your format string, and cast the argument to a (double) it should work with any size or any compiler. Of course will ensure that any float arguments get cast to doubles anyway.

So one day, if you port your code to an ARM or avr-gcc gets 64-bit doubles everything will be fine.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Note that in this regard, avr-gcc is not standard compliant.
The C standards require more than 29 bits of precision for double.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Michael,

What is the clause number in the standard that says that about 29 bits?

Cliff

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
What is the clause number in the standard that says that about 29 bits?
DBL_EPSILON may be at most 1E-9.
A double must be able to represent 1.0 and a number in the half open range (1.0, 1.0+1E-9].
I got my information from the float.h chapter of The Standard C Library, by P. J. Plauger.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm trying to find the sine of maximum float value. The value returns -0.5... whereas the expected value is 0.9.. What may be the reason for this ?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0


Show your test code.

 

EDIT: Later... don't bother I tried this and get the same as you:

 

 

__FLT_MAX__ is 3.40282347e+38F so you are right, this should have resulted in 0.96267131 or similar.

Last Edited: Fri. Jun 28, 2019 - 12:01 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Think about it.    The Sine value is repeated every 2 pi Radians.

 

So a Very Big Number divided by 6.28 is going to have a poor resolution.

 

The way that f-p works is to have an exponent and mantissa with so many digits of resolution.

Works well for most calculations but any large exponent is going to swamp the decimal places value.

 

3.402823466 E + 38  is maximum value.   Subtract 3.14.    The result is not going to give you a Sine value with 180 degrees difference.

 

In human terms.    34028234660000000000000 + 3.14 will still be 34028234660000000000000 not 34028234660000000000003.14

 

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0


I suppose one could argue that if some compilers can do it...

 

 

They all should? ;-)

 

But you can pretty much guarantee that the CRT for MSVC is a little more extensive than that for AVR-LibC...

 

http://svn.savannah.gnu.org/viewvc/avr-libc/trunk/avr-libc/libm/fplib/fp_rempio2.S?revision=2473&view=markup

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

All math.h libraries should cope with FLT_MAX

 

My point was that the result depends on dividing by 6.28 and using the remainder.

The decimal places disappear when you have large exponents e.g. 38

 

I would not expect the least significant bit of a Sine value to be the same on every math library.

 

A bad example.    If the Sun is 93 million miles away would you care about the nearest inch?

A better example.   If you drive from London to Edinburgh you can count the wheel revolutions.    But the final angle of the tyre valve is not so easy.

 

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Think about the precision required.  You'd need about 137 bits to represent __FLT_MAX__ to 0.001 (which seems a reasonable goal if you want fairly accurate sine results).  But since a float is only 32 bits, you are really working with a number like 3402823466?????????????????????????????.???????

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Looks correct to me: https://www.wolframalpha.com/input/?i=sine(340282346638528859811704183484516925440+mod+(2*pi))

 

(3.4028235e38 is only an approximation for __FLT_MAX__. The actual, exact value of __FLT_MAX__ is 340282346638528859811704183484516925440. That number modulo 2pi is approximately 5.734136, and the sine of that is approximately -0.5218765)

 

ETA: See also https://www.h-schmidt.net/FloatConverter/IEEE754.html. Set all of the bits to 1 except for the sign bit and the LSB of the exponent. That gives you the maximum value of a 32-bit IEEE754 float.

Last Edited: Fri. Jun 28, 2019 - 03:56 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Simple exercise

 

float f1, f2, f3;

f1 = __FLT_MAX__;
f2 = __FLT_MAX__ - 1.0;
f3 = f1 - f2;

What is the value of f3?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

kk6gm wrote:

Simple exercise

 

float f1, f2, f3;

f1 = __FLT_MAX__;
f2 = __FLT_MAX__ - 1.0;
f3 = f1 - f2;

What is the value of f3?

f3 is exactly 0. The value 1.0 is rounded to 0 when it's subtracted from a much larger value (__FLT_MAX__), so that f1 == f2.

 

Last Edited: Fri. Jun 28, 2019 - 04:12 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

So f1 == (f1 - 1.0).  I think that proves the point.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

christop wrote:
f3 is exactly 0. The value 1.0 is rounded to 0 when it's subtracted from a much larger value (__FLT_MAX__), so that f1 == f2.
IEEE's implicit abstract machine does the arithmetic first and the rounding after.

If one worked at it, I suppose one might be able to define an equivalent formulation for rounding before subtraction.

I doubt one could do the same thing for multiplication or division.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

floats represent numbers, not ranges,
so sin(FLT_MAX) is meaningful, if not terribly useful.
crlibm can compute sines of large doubles.
It uses multiprecision arithmetic to compute angle*256/pi
to high precision.  The high order bits are dropped.
Another way would be to pre-calculate 2**n % (2pi)
for all relevant values of n.
If you want it good to at least the penultimate bit,
you will still need multiprecision arithmetic.
Correctly rounded functions generally do.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

kk6gm wrote:

So f1 == (f1 - 1.0).  I think that proves the point.

Yes, for sufficiently large f1. I'm not sure what point you're trying to prove. It should be well-understood that adding or subtracting a small-magnitude floating-point number from a large-magnitude floating-point number effectively rounds off the small-magnitude number to 0.

 

The largest binary32 float is 3.40282346638528859811704183484516925440e38, and the next smaller number is 3.40282326356119256160033759537265639424e38. The difference between those is 2.0282409603651670423947251286016e31; that's the "step size" at that magnitude. Adding/subtracting any number sufficiently smaller than that will effectively be rounded to zero.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The point being that trying to calculate sin(__FLT_MAX__) is a fool's errand.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:

christop wrote:
f3 is exactly 0. The value 1.0 is rounded to 0 when it's subtracted from a much larger value (__FLT_MAX__), so that f1 == f2.
IEEE's implicit abstract machine does the arithmetic first and the rounding after.

If one worked at it, I suppose one might be able to define an equivalent formulation for rounding before subtraction.

I doubt one could do the same thing for multiplication or division.

I haven't worked on the nitty-gritty details for a long time now, but I believe (proper) rounding before subtraction gives the same results as rounding after subtraction. If I remember correctly, proper rounding before subtraction/addition requires keeping at least 3 more bits in the temporary rounded result (guard bit, round bit, and sticky bit). (In the case of subtracting a very small number from a very large number, all of the bits of the shifted small number would be zero, but the sticky bit would be one.)

 

In my implementation, I did multiplication by calculating the significand result in a temporary buffer that's twice as wide as the floating-point register I was working with (64-bit register and a 128-bit temporary buffer) and then rounding the wider result afterwards. I'm not sure if it's even possible to properly round before multiplying (it may be possible to round while multiplying, though it doesn't buy much because you still have to calculate all of the least-significant bits to determine the value of the sticky bit).

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

kk6gm wrote:

The point being that trying to calculate sin(__FLT_MAX__) is a fool's errand.

It very well may be, but it can still be computed close to the exact value (Cliff's example in post #10 shows it being computed correctly, and post #12 actually shows an incorrect result). IEEE754 doesn't require transcendental functions like sine to be rounded correctly, but they should still be computed within 1 ulp.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
$ avr-gcc -dM -E - </dev/null | grep DOUBLE
#define __SIZEOF_LONG_DOUBLE__ 4
#define __SIZEOF_DOUBLE__ 4
$ avr-gcc -dM -E - </dev/null | grep FLOAT
#define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
#define __SIZEOF_FLOAT__ 4
$ avr-gcc --version
avr-gcc (GCC) 5.4.0

 

Last Edited: Fri. Jun 28, 2019 - 09:05 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

christop wrote:
skeeve wrote:

christop wrote:
f3 is exactly 0. The value 1.0 is rounded to 0 when it's subtracted from a much larger value (__FLT_MAX__), so that f1 == f2.
IEEE's implicit abstract machine does the arithmetic first and the rounding after.

If one worked at it, I suppose one might be able to define an equivalent formulation for rounding before subtraction.

I doubt one could do the same thing for multiplication or division.

I haven't worked on the nitty-gritty details for a long time now, but I believe (proper) rounding before subtraction gives the same results as rounding after subtraction. If I remember correctly, proper rounding before subtraction/addition requires keeping at least 3 more bits in the temporary rounded result (guard bit, round bit, and sticky bit). (In the case of subtracting a very small number from a very large number, all of the bits of the shifted small number would be zero, but the sticky bit would be one.)
I think you need about 2 bits past the lsb of the operand with the larger absolute value.

The second bit is the or of all the lower bits.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

christop wrote:
It very well may be, but it can still be computed close to the exact value (Cliff's example in post #10 shows it being computed correctly, and post #12 actually shows an incorrect result). IEEE754 doesn't require transcendental functions like sine to be rounded correctly, but they should still be computed within 1 ulp.
That is tricky, even in the range [0, pi/4].

It has been done, multiprecision arithmetic is required.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

skeeve wrote:

I think you need about 2 bits past the lsb of the operand with the larger absolute value.

The second bit is the or of all the lower bits.

 

No, you do need 3 bits (guard, round, and sticky) for proper rounding. See http://pages.cs.wisc.edu/~david/courses/cs552/S12/handouts/guardbits.pdf for a good explanation why we need a guard bit for proper rounding after subtraction.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

similarly, how the cos of maximum and minimum float values are computed? expected value is 0.55 for both whereas the output obtained is 0.85

Last Edited: Wed. Jul 24, 2019 - 12:56 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

 

Suggest you play with:

 

https://www.h-schmidt.net/FloatConverter/IEEE754.html

 

if you want to know how IEE754 works, For example:

 

 

and the negative limit is:

 

EDIT: just realised that is EXACTLY what christop already suggested you do in #15:

christop wrote:
ETA: See also https://www.h-schmidt.net/FloatConverter/IEEE754.html. Set all of the bits to 1 except for the sign bit and the LSB of the exponent. That gives you the maximum value of a 32-bit IEEE754 float.

So which bit of that don't you understand?

 

(BTW the reason it is FE not FF in the exponent for the MIN/MAX is that FF is a special value used to denote NaN)

Last Edited: Wed. Jul 24, 2019 - 01:03 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

my doubt is cos of maximum float value output by atmel is 0.85. It was supposed to be 0.55. why is that so ?

I could observe the output is exactly the inverse of cos value..

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

namratha wrote:
why is that so ?
Which bit of post 15 is it that you do not understand?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
(BTW the reason it is FE not FF in the exponent for the MIN/MAX is that FF is a special value used to denote NaN)
Also the infinities.

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles