## 16 X 16 bit multiply to a 32 bit result

17 posts / 0 new
Author
Message

Hello all,

I have a question on how avr-gcc handles large (16x16) multiplies. I'm working with an XMega128A1 and am use avr-gcc 4.3.2.

I have a project that will require lots of 16X16 multiplies and in general I need all 32 bits of the result. Of course I need to use the smallest number of cycles.

I've tried several constructs. Here are two:

int ia, ib;
long lc;

lc = (long) ia*ib;

This one does not work. The code generated builds the 16 bit result and then sign extends that to 32 bits - not what I need.

lc = (long)ia * (long)ib;

This produces the correct results but wastes quite a few cycles. It first sign extends ia and ib, then calls the routine __mulsi3, which does a full 32x32 multiply. My guess is this requires ~50% more cycles than needed.

My question: is there a way to get the compiler to do just the 16 X 16 to 32 bit multiply? I can build my own using embedded assembly, but I was hoping that I won't have to do that.

Thanks Fred

As far as I know there is no such funtion predifined to do 16 Bit x 16 Bits -> 32 Bits. So the best would be to use inline ASM, maybe together with a #Define to make it more reusable.

If you are lucky someone allready has such code.
I just did a slighly more special case of doing the square of a 10 bit number and add it to a 32 bits number.

I checked my test code and I got this slightly wrong. Both constructs I mentioned:
lc = (long) ia*ib;
lc = (long)ia * (long)ib;
generate the same full 32x32 bit code which takes way more cycles than are really needed.

The construct
lc = (long) (ia*ib);
generates the a 16 bit result that is then sign extended.

I still need to generate the 16x16 to 32 bit multiply to cut cycles.

Can you get away with unsigned?

"SCSI is NOT magic. There are *fundamental technical
reasons* why it is necessary to sacrifice a young
goat to your SCSI chain now and then." -- John Woods

Take the long mul 32x32=32
look at the generated ASM code
And chop away all the extra code.

It _is_ possible to get the compiler to do this.

I have actually gotten the compiler to do this, but it doesn't normally do it because it will make your code bigger since there are more functions in libgcc. I wrote routines for both signed and unsigned multiplication for 16x16 to 32 bits to make it work, then told gcc about them.

The consensus seems to be that avr-gcc won't wuite do what I want the way it is now. geckosenator has a patch that he's trying to get in that will do what I need. Until that patch is ready I've written a couple inline assembly macros to do what I need. In case anyone else is interested, here is what I've come up with. There are 4, a signed and an unsigned 16X16->32 bit multiply and a signed and an unsigned 8X16->24 bit multiply.

I've tried these out on the XMega simulator. I believe that their sold. Of course, if I've done anything stupid, please let me know.

#define MultiS16X16to32(longRes, intIn1, intIn2) \
asm volatile ( \
"clr r16 \n\t" \
"mul %A1, %A2 \n\t" \
"movw %A0, r0 \n\t" \
"muls %B1, %B2 \n\t" \
"movw %C0, r0 \n\t" \
"mulsu %B2, %A1 \n\t" \
"sbrc r1, 7 \n\t" \
"ser r16 \n\t" \
"clr r16 \n\t" \
"mulsu %B1, %A2 \n\t" \
"sbrc r1, 7 \n\t" \
"ser r16 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (longRes) \
: \
"a" (intIn1), \
"a" (intIn2) \
: \
"r16" \
)

#define MultiU16X16to32(longRes, intIn1, intIn2) \
asm volatile ( \
"clr r16 \n\t" \
"mul %A1, %A2 \n\t" \
"movw %A0, r0 \n\t" \
"mul %B1, %B2 \n\t" \
"movw %C0, r0 \n\t" \
"mul %B2, %A1 \n\t" \
"mul %B1, %A2 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (longRes) \
: \
"a" (intIn1), \
"a" (intIn2) \
: \
"r16" \
)

#define MultiS8X16to24(longRes, charIn1, intIn2) \
asm volatile ( \
"clr r16 \n\t" \
"mulSU %A1, %A2 \n\t" \
"sbrc r1, 7 \n\t" \
"ser r16 \n\t" \
"movw %A0, r0 \n\t" \
"mov %C0, r16 \n\t" \
"mov %D0, r16 \n\t" \
"clr r16 \n\t" \
"muls %A1, %B2 \n\t" \
"sbrc r1, 7 \n\t" \
"ser r16 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (longRes) \
: \
"a" (charIn1), \
"a" (intIn2) \
: \
"r16" \
)

#define MultiU8X16to24(longRes, charIn1, intIn2) \
asm volatile ( \
"clr r16 \n\t" \
"mul %A1, %A2 \n\t" \
"movw %A0, r0 \n\t" \
"mov %C0, r16 \n\t" \
"mov %D0, r16 \n\t" \
"mul %A1, %B2 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (longRes) \
: \
"a" (charIn1), \
"a" (intIn2) \
: \
"r16" \
)

Why are you sign extending? Use muls and mulsu to multiply signed numbers

Muls and mulsu botgh produce a signed 16 bit result. The middle operations (high byte of one * low byte the other) requires a signed 24 bit result. It cost me several cycles. Is there an easier/quicker way?

yes, don't use ser or clr or sbrc anywhere

Ok,

I am having difficulty getting gcc to emit code for 16x16 to 32 bit multiplication.

You would expect it to sign or zero extend then call mulsi3, however, it does not, instead it multiplies with a 16bit result then sign or zero extends that to 32bits.

Maybe there is something special about the size of an int in c, that allows for this?

In the 8 bit case it is different. Let me know if you can give me test code where gcc zero extends both operands then calls mulsi3. Thanks

The first post in the thread shows you how to do it.
Dave Raymond

No that is different. It has to call mulsi in that case the same as "(int)x*(int)y" calls muldi. For some reason:

```uint16_t test(uint8_t x, uint8_t y)
{return x*y;}
```

this code generates a single mul instruction, and makes use of the 8x8 to 16 bit pattern. If you try the same thing for 16 bits to 32 bits, it no longer hits this pattern.

Actually, my first post wasn't quite right. Both the constructs:
lc = (long) ia*ib;
and
lc = (long)ia * (long)ib;
will sign extend and then call __mulsi to get the full 32x32 multiply done.

The construct lc = (long) (ia*ib); does a multiplies out to only 16 bits and then sign extends that result.

I think there are a couple of C-isms going on here. If I remember right, all values in an expression are promoted to the the largest type, then the operation is performed.

In the first construct "(long)ia" is 32 bits, so ib is promoted to 32 and then the multiply is performed. In the second construct I've explicitly promoted both ia and ib to 32 bits. In the 3rd construct, (ia*ib) is 16 bits so the multiply is done on that basis, then the result is promoted because of the cast to a long.

I think your 8-bit example does the full 8X8 to 16 bit mupltiply because C favors ints. So in (charA * charB), both chars get promoted to ints before the operation.

Any C experts out there? Do I have this right or am I all wet?

bldrcowboy wrote:

I think your 8-bit example does the full 8X8 to 16 bit mupltiply because C favors ints. So in (charA * charB), both chars get promoted to ints before the operation.

Any C experts out there? Do I have this right or am I all wet?

Exactly. I can make it do the optimal multiplication, but it won't do you any good because gcc will never pick that pattern. I need a test code to do it, and I don't think you can in c.

On another note.. I got gcc to emit "mulsu" when multiplying two 8 bit values where one is signed and the other is not.

Fanx allot bldrcowboy for your MultiU8X16to24. Fitted into my code perfectly.

For instance, since I write my programs in assembly only, the following code (for ATmega8) is the subroutine that I usually use to get this 32-bit result:

```; [ 16x16 Bit Unsigned Multiplication ]
; multiplicand: r17:r16
; multiplier  : r14:r13
; result out  : r21:r20:r19:r18
; 24 cycles
M_16x16:
MUL   r17, r14              ; ah * bh
MOVW  r21:r20, r1:r0
MUL   r16, r13              ; al * bl
MOVW  r19:r18, r1:r0
MUL   r17, r13              ; ah * bl
CLR   r13