## 32 bit multiply with 64 bit result needed. (NOT mulsi3?)

8 posts / 0 new
Author
Message

I need a routine to multiply two 32 bit unsigned integers and return a 64 bit result.
It appears that if I try doing

unsigned long long int r;
unsigned long int a,b;

and then r=a*b;

The compiler generates a call to a buitin function __mulsi3, which according to the
source only returns a 32 bit number.

Now what I am really trying to do is to perform a fixed point multipy of an
unsigned long number by a fixed point constant. The constant is represented by
a fixed point long with an assumed decimal point 24 bits to the right of bit 0. (IE: an
8 bit integer with a 24 bit fraction). I would discard the lowest 24 bits of the result
and keep the next 32 sig bits.

Short of modifing the assembler source for the library multiply routine (I am using a
mega32 cpu, so I think the mul instruction is available) to provide a true 32x32=64 bit
routine, is there another way?

Probably not, at least AFAIK.

I would definitely recommend bringing up this issue on the avr-libc-dev mailing list and/or submit a bug report on avr-libc.

the behavior is correct here. doing integer arithmetics
the compiler searches the right side for the variable
with the biggest size to decide the size of your result.
wich in your case is 32 bit.
the compiler cannot handle overflows by itself.

try the same with "unsigned long long a,b" or
use a typecast :

r = (unsigned long long)a * (unsigned lon long)b;

Hello,

The problem with 64 bit arithmetics is the low speed. I've implemented the signed fractional fixed point multiplication (1.31 x 1.31 => 1.63) in assembler resulting 177 cycles. The same function implemented in pure C with avr-gcc takes 1,408 cycles and that's a big difference. My expectations were between 3 and 4 times taking into account the discarded terms only. You should weight your options.

Regards,

Carlos.

Could you post your assembler multiply routine? It looks like it might be possible to use
the hardware 8x8 multiply combined with adding the fractional results (an example of a
16x16 multiply doing the same is in the AVR instruction reference manual). Even at 1408 cycles, at 16mhz that's still 11k operations/sec, I wouldn't need more than a 100 updates/sec in my application.

The problem with 64 bit arithmetics is the low speed. I've implemented the signed fractional fixed point multiplication (1.31 x 1.31 => 1.63) in assembler resulting 177 cycles. The same function implemented in pure C with avr-gcc takes 1,408 cycles and that's a big difference. My expectations were between 3 and 4 times taking into account the discarded terms only. You should weight your options.

Regards,

Carlos.

Here's my 32x32=64bit mult in ASM (92 cycles, but if You use movw to clear regs, = 90 cycles):

This is my unfinished math rutins (mult & add/sub only, and some bugs :):
This rutins use 32 bit mantissa and 8 bit exp a'la ZX Spectrum.

Anyone wants to finish it?

Here's my 32x32=64bit mult in ASM (92 cycles, but if You use movw to clear regs, = 90 cycles):

This is my unfinished math rutins (mult & add/sub only, and some bugs Smile:
This rutins use 32 bit mantissa and 8 bit exp a'la ZX Spectrum.

Anyone wants to finish it?

Thanks! Looks good, I thought there was a way to use the hardware multiply. Actually
if you think about it, the code is the same as the way you would do it on paper one decimal
digit at a time. I think I'll need to look at the register usage and maybe re-assign some
of the registers to make the routine C-callable. That will add some cycles, but the
total should still be well under 200.

VFX wrote:
Here's my 32x32=64bit mult in ASM (92 cycles, but if You use movw to clear regs, = 90 cycles): http://avr.vfx.hu/ http://www.vfx.hu/avr/download/mult32.asm This is my unfinished math rutins (mult & add/sub only, and some bugs :): http://www.vfx.hu/avr/download/m... This rutins use 32 bit mantissa and 8 bit exp a'la ZX Spectrum. Anyone wants to finish it? VFX. http://www.vfx.hu

Very old post, but it does worth improvement.

On your neat multiplication at http://www.vfx.hu/avr/download/mult32.asm, you could reduce few code bytes and clock cycles if you do first the multiplication of same byte positions, they don't require carry propagation since the results go directly to empty positions. Also it allows to eliminate the CLR R10~R15 on top, total of 15 instructions saved:

```MUL R2, R21
MOVW R9:R8, R1:R0

MUL R3, R22
MOVW R11:R10, R1:R0

MUL R4, R23
MOVW R13:R12, R1:R0

MUL R5, R24
MOVW R15:R14, R1:R0```

and then, do the crossing bytes multiplication as usual, ADD, ADC, etc.

Wagner Lipnharski
Orlando Florida USA