32bit X 32bit multiply

Go To Last Post
27 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I tried to code a 32bit by 32 bit multiply. This would normally yield a 64 bit result, in my case I can throw away the least sig. 24 bits since I am assuming that the multiplier is a fixed point number as XXXXXXXX.FFFFFFFFFFFFFFFFFFFFFFFF, and the multiplicand is a 32 bit integer. The answer need only contain the integer part, not the fraction.

So I wrote
uint32_t CalNumber(uint32_t F, uint32_t K)
{
uint64_t R;

R = F * K;
return R >> 24;
}

The resulting assembler is:
155 .global CalNumber
156 .type CalNumber, @function
157 CalNumber:
158 /* prologue: frame size=0 */
159 0100 8F92 push r8
160 0102 9F92 push r9
161 0104 AF92 push r10
162 0106 BF92 push r11
163 0108 CF92 push r12
164 010a DF92 push r13
165 010c EF92 push r14
166 010e FF92 push r15
167 0110 0F93 push r16
168 0112 1F93 push r17
169 /* prologue end (size=10) */
170 0114 DC01 movw r26,r24
171 0116 CB01 movw r24,r22
GAS LISTING C:\DOCUME~1\kscharf\LOCALS~1\Temp/ccmqaaaa.s page 4

172 0118 BC01 movw r22,r24
173 011a CD01 movw r24,r26
174 011c 0E94 0000 call __mulsi3
175 0120 DC01 movw r26,r24
176 0122 CB01 movw r24,r22
177 0124 4C01 movw r8,r24
178 0126 5D01 movw r10,r26
179 0128 CC24 clr r12
180 012a 08E1 ldi r16,lo8(24)
181 012c 282F mov r18,r24
182 012e 392D mov r19,r9
183 0130 4A2D mov r20,r10
184 0132 5B2D mov r21,r11
185 0134 6C2D mov r22,r12
186 0136 7C2D mov r23,r12
187 0138 8C2D mov r24,r12
188 013a 9C2D mov r25,r12
189 013c 0E94 0000 call __lshrdi3
190 0140 A22E mov r10,r18
191 0142 B32E mov r11,r19
192 0144 C42E mov r12,r20
193 0146 D52E mov r13,r21
194 0148 C601 movw r24,r12
195 014a B501 movw r22,r10
196 /* epilogue: frame size=0 */
197 014c 1F91 pop r17
198 014e 0F91 pop r16
199 0150 FF90 pop r15
200 0152 EF90 pop r14
201 0154 DF90 pop r13
202 0156 CF90 pop r12
203 0158 BF90 pop r11
204 015a AF90 pop r10
205 015c 9F90 pop r9
206 015e 8F90 pop r8
207 0160 0895 ret
208 /* epilogue end (size=11) */
209 /* function CalNumber size 49 (28) */
210 .size CalNumber, .-CalNumber
211 .comm stable,132,1

Not having the source to mulsi3 or lshrdi3 I'm not sure this is what I wanted as 24 bits of return seem to be copied. Maybe someone more familiar with the avr assembler might comment on this.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I think that your problem is the multiply is with two 32 bit numbers, which results in a 32 bit number. Only after the multiply is done is the number extended to 64 bits. You would need:

R = (uint64_t)F * (uint64_t)K;

though only one of the explicate casts is actually necessary. Also keep in mind that since you are shifting 24 bit, the result will be a 40 bit number. But you are only returning a 32 bit number, so the top 8 bits will be lost.

Regards,
Steve A.

The Board helps those that help themselves.

Last Edited: Thu. Dec 14, 2006 - 06:18 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

In my case the range of numbers is such that the result should be less that 2^32 after disposing of the ls 24 bits.

I don't see why I need to cast the two numbers as uint64_t since they were declared that way in the argument list of the function. But I'll try it and see if it makes a difference.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
I don't see why I need to cast the two numbers as uint64_t since they were declared that way in the argument list of the function.

In the code in your first post they are declared as uint32_t.

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

After casting the two arguments as uint64_t in the multiply statement I get the following assembler...

155 .global CalNum
156 .type CalNum, @function
157 CalNum:
158 /* prologue: frame size=8 */
159 0100 2F92 push r2
160 0102 3F92 push r3
161 0104 4F92 push r4
162 0106 5F92 push r5
163 0108 6F92 push r6
164 010a 7F92 push r7
165 010c 8F92 push r8
166 010e 9F92 push r9
167 0110 AF92 push r10
168 0112 BF92 push r11
169 0114 CF92 push r12
170 0116 DF92 push r13
171 0118 EF92 push r14
GAS LISTING C:\DOCUME~1\kscharf\LOCALS~1\Temp/ccsVaaaa.s page 4

172 011a FF92 push r15
173 011c 0F93 push r16
174 011e 1F93 push r17
175 0120 CF93 push r28
176 0122 DF93 push r29
177 0124 CDB7 in r28,__SP_L__
178 0126 DEB7 in r29,__SP_H__
179 0128 2897 sbiw r28,8
180 012a 0FB6 in __tmp_reg__,__SREG__
181 012c F894 cli
182 012e DEBF out __SP_H__,r29
183 0130 0FBE out __SREG__,__tmp_reg__
184 0132 CDBF out __SP_L__,r28
185 /* prologue end (size=26) */
186 0134 DC01 movw r26,r24
187 0136 CB01 movw r24,r22
188 0138 1C01 movw r2,r24
189 013a 2D01 movw r4,r26
190 013c 6624 clr r6
191 013e 662D mov r22,r6
192 0140 762D mov r23,r6
193 0142 862D mov r24,r6
194 0144 962D mov r25,r6
195 0146 A22E mov r10,r18
196 0148 B32E mov r11,r19
197 014a C42E mov r12,r20
198 014c D52E mov r13,r21
199 014e E62C mov r14,r6
200 0150 F62C mov r15,r6
201 0152 062D mov r16,r6
202 0154 162D mov r17,r6
203 0156 222D mov r18,r2
204 0158 332D mov r19,r3
205 015a 442D mov r20,r4
206 015c 552D mov r21,r5
207 015e 0E94 0000 call __muldi3
208 0162 08E1 ldi r16,lo8(24)
209 0164 0E94 0000 call __lshrdi3
210 0168 A22E mov r10,r18
211 016a B32E mov r11,r19
212 016c C42E mov r12,r20
213 016e D52E mov r13,r21
214 0170 C601 movw r24,r12
215 0172 B501 movw r22,r10
216 /* epilogue: frame size=8 */
217 0174 2896 adiw r28,8
218 0176 0FB6 in __tmp_reg__,__SREG__
219 0178 F894 cli
220 017a DEBF out __SP_H__,r29
221 017c 0FBE out __SREG__,__tmp_reg__
222 017e CDBF out __SP_L__,r28
223 0180 DF91 pop r29
224 0182 CF91 pop r28
225 0184 1F91 pop r17
226 0186 0F91 pop r16
227 0188 FF90 pop r15
228 018a EF90 pop r14
GAS LISTING C:\DOCUME~1\kscharf\LOCALS~1\Temp/ccsVaaaa.s page 5

229 018c DF90 pop r13
230 018e CF90 pop r12
231 0190 BF90 pop r11
232 0192 AF90 pop r10
233 0194 9F90 pop r9
234 0196 8F90 pop r8
235 0198 7F90 pop r7
236 019a 6F90 pop r6
237 019c 5F90 pop r5
238 019e 4F90 pop r4
239 01a0 3F90 pop r3
240 01a2 2F90 pop r2
241 01a4 0895 ret
242 /* epilogue end (size=25) */
243 /* function CalNum size 83 (32) */
244 .size CalNum, .-CalNum

WOW. In not sure what's going on now. Why the access to the stack register? Passing arguments on the stack to the helper mult and shift functions?
This is looking more and more like I should just define some inline assembler code and do this a LOT more efficently.

//Enter with copy of Arg and Konstant provided.....
//Konstant is destroyed, Arg is not.
(psudeo code)
byte Product[5]; //result goes here
byte Konstant[4]; //fixed point constant 8.24
byte Arg[4]; // int argument
byte count;

Product[5..0] = 0;
for (count=0; count < 32 ; count++){
if(!(Konstant[0] & 1)) goto NoAdd;
Add Product[1],Arg[0];
Addc Product[2],Arg[1];
Addc Product[3],Arg[2];
Addc Product[4],Arg[3];
NoAdd: //product and konstant long shift register
Clr Carry;
RRC Product[4]; //rotate right with carry
RRC Product[3];
RRC Product[2];
RRC Product[1];
RRC Procduct[0];
RRC Konstant[3];
RRC Konstant[2];
RRC Konstant[1];
RRC Konstant[0];
} //end loop
return Product[3..0];

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

All the push's and pop's are there only to save and restore the registers that are being used in the multiply. If you wrote inline assembler, you would have to do the same (though you could probably reduce the number of registers some).

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I think this sequence:

177 0124 CDB7 in r28,__SP_L__ 
178 0126 DEB7 in r29,__SP_H__ 
179 0128 2897 sbiw r28,8 
180 012a 0FB6 in __tmp_reg__,__SREG__ 
181 012c F894 cli 
182 012e DEBF out __SP_H__,r29 
183 0130 0FBE out __SREG__,__tmp_reg__ 
184 0132 CDBF out __SP_L__,r28 

is modifying the stack. The "in__tmp_reg" saves the interrupt status, then the cli disables interrupts while the stack is modified. But the interrupt status is restored before the stack low register is modified. IIRC, that's because the interrupt re-enable is delayed by one instruction cycle and the complier took advantage of that fact.

Yeah, the routine uses a LOT of registers. If I wrote the code code inline I could use memory locations for variable storage, but would still need registers for the operations (don't think the avr allows memory locations to be actual math operands,this being a risc load/store machine). I'll have to read up on the instruction set. Still it's better than the PIC with those damn page swaps all over creation!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It's directly modifying the stack because it thinks it needs to make space for additional local variables which are too big to fit in the GPRs.

Inspecting the generated assembly, it doesn't appear to strictly be necessary in this case. But perhaps the __muldi3 function needs to store intermediate information (beyond the simple function arguments) on the stack, and for some unknowable reason it's been left up to the calling function to allocate space on the stack.

(A quick scan of the "usual suspects" in libgcc.S and in the avr-libc source code didn't turn up any mention of a __muldi3 function, so maybe it's a built-in GCC function.)

The two 64-bit multiplicands are passed into __muldi3 in (r18 up to r25), and in (r10 up to r17). In accordance with the AVR ABI, the 64-bit result is passed back in (r18 up to r25).

The memory in ([SP+1] up to [SP+8]) is probably used by the _muldi3 function internally.

Then, the 64-bit result in (r18 up to r25) is shifted to the right 24 bits by the call to __lshrdi3. (It knows to shift by 24 bits because 24 is loaded into (r17:r16))

I'd love to get a look at the disassembly for __muldi3 so we can figure out what it's doing with the stack space. You'd probably need to fully link the code snippet into a finished program in order to catch a glimpse of its actual implementation.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Since my need is a special case I'd like to try to write my own assembler routine to do this.
I already posted some psudeo code.
Looks like I'd need 14 registers to do the job.
I'd be multiplying a 32 bit int by a fixed point 32 bit number (8.24). The lower 24 bits of the result
(right side of the binary point) are toast. The most sig 8 bits are un-needed as the result is expected to fit in 32 bits, any overflow would be an error.
(I'm calculating the DDS word for an AD9951. I multiply a frequency number (max value 39E06) by a constant (2^32 / 400E06), the result of which MUST fit in 32 bits.) The method is to test the ls byte of the constant, add in the freq word to the product if the test is true, shift the product and constant right by one, and do again 31 more times. By having storage for 5 bytes of product and adding starting at the second byte position, after 32 shifts the lower 4 bytes of the product have the answer!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Here is my assembler code for the routine.
Now I only need to figure out how to write it
using inline assembler macros and how to pass
the input and output arguments so I can call it
from C.

push r2     //save registers
push r3
 .....
push r15
lds r2,K     //load K
lds r3,K+1
lds r4,K+2
lds r5,K+3
ldi r6,0     //zero P
mov r7,r6
mov r8,r7
mov r9,r8
mov r10,r9
lds r11,F   //load F
lds r12,F+1
lds r13,F+2
lds r14,F+3
ldi r15,32  //count
again:
sbrs r2,0
rjmp noadd
add r7,r11
addc r8,r12
addc r9,r13
addc r10,r14
noadd:
clc
ror r10  //shift P and K right
ror r9
ror r8 
ror r7
ror r6
ror r5
ror r4
ror r3
ror r2
dec r15
brne again
//result in R6..R9
//now only need to move result back to ram
//and pop stack.....
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

By the way, are you linking against libm.a - that is does your compiler invocation include -lm ??

(it will if you are using an Mfile generated Makefile but it won't if you are using AVR Studio as an IDE for GCC as it makes a silly decision not to link against libm.a by default)

Cliff

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

from my makerules and makefile....

%elf: $(OBJS)
$(CC) $(OBJS) $(LDFLAGS) $(LIBS) -o $@
LIBS = $(LIBDIR)/nutinit.o -lnutpro -lnutos -lnutdev -lnutcrt -lnutnet -lm

So yes I am.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Here is my final version, as calcddsword.S

#include 


        .section .text
        .global CalcDdsWord
        ;;
        ;; Called from "C" uint32_t CalcDdsWord(uint32_t Freq, uint32_t Konst)
        ;; where Freq is the desired osc frequency in hz
        ;; Konst is the value of (2^^32 / RefClock) as a binary fraction
        ;; expressed as 8.24
        ;; Freq is passed in r22-r25
        ;; Konst is passed in r18-r21
        ;; result is returned in r22-r25 overwriting Freq so we will
        ;; copy Freq to r14-r17 first
        ;; r26 is available and is used as an extension to the product
        ;; r27 is available and is used as the loop counter
CalcDdsWord:
        ;; first save volatile registers
        push    r14
        push    r15
        push    r16
        push    r17
        mov     r14,r22         ; copy value of Freq to temp register
        mov     r15,r23
        mov     r16,r24
        mov     r17,r25
        ldi     r26,0           ;zero product accumulator
        mov     r25,r26
        mov     r24,r26
        mov     r23,r26
        mov     r22,r26
        ldi     r27,32          ; load loop counter with 32
1:      sbrs    r18,0           ; test bit zero of Konst, if set skip rjmp
        rjmp    2f              ; no add this iteration
        add     r23,r14         ; add Freq to Product
        adc     r24,r15         ; starting at second byte position
        adc     r25,r16         ; result will be right shifted 24 bits
        adc     r26,r17         ; when done
2:      clc                     ; clear carry, shift product & Konst right
        ror     r26             ; rotate product right
        ror     r25             ; through 40 bits
        ror     r24             ; 
        ror     r23
        ror     r22
        ror     r21             ; rotate Konstant
        ror     r20             ; through 32 bits
        ror     r19
        ror     r18
        dec     r27             ; decrement loop counter
        brne    1b              ; continue with another loop
        pop     r17             ; restore temp register
        pop     r16
        pop     r15
        pop     r14
        ret                     ; done!

        .end

So I can call the function as
ddsword = CalcDdsWord(freq, konst);
Looks like I am using the correct linkage registers for passing arguments. Only have to save 4 registers on the stack!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

hum...

If you break up the layout of the duplicate copy of Freq so that it lies in (for example)
r16:r17:r30:r31

then you'll only need to save two registers on the stack. (r16, r17)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
If you break up the layout of the duplicate copy of Freq so that it lies in (for example)
r16:r17:r30:r31

then you'll only need to save two registers on the stack. (r16, r17)


Hmm, didn't see that. The temp registers don't have to be contigious. The FAQ does state that r30-31 are considered volatile. :-) Good idea!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

one thing. why I am getting this warning?

Assembling: CalcDdsWord.S
avr-gcc -c -mmcu=atmega16 -I. -x assembler-with-cpp -Wa,-adhlns=CalcDdsWord.lst,-gstabs  CalcDdsWord.S -o CalcDdsWord.o
CalcDdsWord.S:56:9: warning: no newline at end of file

There IS a newline at the end of the file (at least I did hit after the .end in xemacs)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well if you really did cut/paste the entire file in the post shown above then that ".end" looks suspiciously like it's on the very last line. It's well documented (various threads here) why GCC wants a blank line at the end of files.

Cliff

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
Well if you really did cut/paste the entire file in the post shown above then that ".end" looks suspiciously like it's on the very last line. It's well documented (various threads here) why GCC wants a blank line at the end of files.

No,there were two blank lines following, but the last line had a space in it and NO crlf after. (Kinda hard to see unless you select edit hex!) That was it.
Well it was only a WARNING, not fatal, but it did bug me! thanks!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Here's my 32bit x 32bit = 64 bit multiply rutin (92 Cycles, but I waste 2 cylces)

http://www.vfx.hu/avr/download/m...

VFX.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
mult32:
		clr	R16
		mul	R21,R2
		movw	R8,R0
		clr	R10
		clr	R11
		clr	R12
		clr	R13
		clr	R14
		clr	R15
		mul	R22,R2
		add	R9,R0

Looks interresting. It looks like the contents of r1 (msb after first multiply) are lost (not added into result). The rest of the function seems to follow the desired pattern though. Nice idea for a true 32x32->64 bit function. The register use is wrong for it to be callable from C though. :-(
Note that my function does NOT return a 64 bit result on purpose....I am not really doing a true 32x32 multiply because the multiplier is really a fixed point number, and I truncate the result to an integer.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

kscharf wrote:

It looks like the contents of r1 (msb after first multiply) are lost (not added into result).

So, "movw" instructions copy 2 registers from source (register and register+1) to destination. Source R1:R0 and destination R9:R8 in this code example.

This code works well :)

VFX.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

My eyes must be going. Didn't see the "w".

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I modified your code so it can be called from "C". Mostly had to re-assign registers, save some registers, and copy the input arguments to working registers as the compiler expects the result returned in the same registers as the arguments were passed in.

;***************************************************
;* Mutiply 32x32 -> 64 bit
;;  
;; MODIFIED to be "C" callable from Winavr gnu c
;;  Now pass arguments as follows:      
;;  Arg 1 in R25-R24-r23-r22
;;  Arg 2 in R21-R20-r19-r18
;;  return in R25-r24-r23-r22-r21-r20-r19-r18 
;;
;;  Need to use temp registers R31-r30-r27-r26 and
;;; R5-r4-r3-r2 also r16  (need to stack these on entry)
;*
;*  121 cycles (mega16) 15us@8mhz
;;; uint64_t mult32(uint32_t x, uint32_t y)
mult32:
        push    r2              ; save working registers
        push    r3              ; 
        push    r4
        push    r5
        push    r16
        clr     r16             ; clear new zero register
        mov     r31,r25         ; copy arguments to working registers
        mov     r30,r24         ; so we can pass the result in
        mov     r27,r23         ; same registers that were used to
        mov     r26,r22         ; pass the arguments in.
        mov     r5,r21
        mov     r4,r20
        mov     r3,r19
        mov     r2,r18
        ;; mult by ls digit of multiplier
        mul     r26,r2          ; a0xb0
		movw	r18,r0          ; result in p1-p0
		clr     r20             ; clear p2-p7
		clr     r21
		clr     r22
		clr     r23
		clr     r24
		clr     r25
		mul     r27,r2          ; a1xb0
		add     r19,r0          ; add at p1
		adc     r20,r1          ; add in msp at p2 with carry
		mul     r30,r2          ; a2xb0
		add     r20,r0          ; add at p2
		adc     r21,r1          ; include msp and  carry in p3
		mul     r31,r2          ; a3xb0
		add     r21,r0          ; add at p3
		adc     r22,r1          ; include msp and carry in p4
        ;; mult by 2nd digit of multiplier
		mul     r26,r3          ; a0xb1
		add     r19,r0          ; add at p1
		adc     r20,r1          ; add msp and carry at p2
		adc     r21,r16         ; add carry to p3
		adc     r22,r16         ; add carry to p4
		adc     r23,r16         ; add carry to p5
		mul     r27,r3          ; a1xb1
		add     r20,r0          ; add at p2
		adc     r21,r1          ; add msp and carry at p3
		adc     r22,r16         ; add carry to p4
		adc     r23,r16         ; add carry to p5
		mul     r30,r3          ; a2xb1
		add     r21,r0          ; add at p3
		adc     r22,r1          ; add msp and carry at p4
		adc     r23,r16         ; add carry at p5
		mul     r31,r3          ; a3xb1
		add     r22,r0          ; add at p4
		adc     r23,r1          ; add msp and carry at p5
        ;; mult by 3rd digit of multiplier
		mul     r26,r4          ; a0xb2
		add     r20,r0          ; add at p2
		adc     r21,r1          ; add msp with carry at p3
		adc     r22,r16         ; add carry to p4
		adc     r23,r16         ; add carry to p5
		adc     r24,r16         ; add carry to p6
		mul     r27,r4          ; a1xb2
		add     r21,r0          ; add at p3
		adc     r22,r1          ; add msp with carry at p4
		adc     r23,r16         ; add carry to p5
		adc     r24,r16         ; add carry to p6
		mul     r30,r4          ; a2xb2
		add     r22,r0          ; add at p4
		adc     r23,r1          ; add msp with carry at p5
		adc     r24,r16         ; add carry to p6
		mul     r31,r4          ; a3xb2
		add     r23,r0          ; add at p5
		adc     r24,r1          ; add msp with carry at p6
        ;; mult by 4th digit of multiplier
		mul     r26,r5          ; a0xb3
		add     r21,r0          ; add at p3
		adc     r22,r1          ; add msp with carry at p4
		adc     r23,r16         ; add carry to p5
		adc     r24,r16         ; add carry to p6
		adc     r25,r16         ; add carry to p7
		mul     r27,r5          ; a1xb3
		add     r22,r0          ; add at p4
		adc     r23,r1          ; add msp with carry to p5
		adc     r24,r16         ; add carry to p6
		adc     r25,r16         ; add carry to p7
		mul     r30,r5          ; a2xb3
		add     r23,r0          ; add at p5
		adc     r24,r1          ; add msp and carry at p6
		adc     r25,r16         ; add carry to p6
		mul     r31,r5          ; a3xb3
		add     r24,r0          ; add at p6
		adc     r25,r1          ; add msp and carry at p7
        ;; restore registers
        pop     r16
        pop     r5
        pop     r4
        pop     r3
        pop     r2
        clr     r1              ; clear zero register
		ret
        
        .end
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

kscharf wrote:

mov r31,r25 ; copy arguments to working registers
mov r30,r24 ; so we can pass the result in
mov r27,r23 ; same registers that were used to
mov r26,r22 ; pass the arguments in.
mov r5,r21
mov r4,r20
mov r3,r19
mov r2,r18

We can save some cycles [4] and bytes [8]:

movw R30,R24
movw R26,R22
movw R4,R20
movw R2,R18

and we save additional 2 cycles and 4 bytes:
change this:

clr r20
clr r21
clr r22
clr r23
clr r24
clr r25

to:
clr r20
clr r21
movw R22,R20
movw R24,R20

Summ: shoter by 6 cycles and 12 byte

VFX.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Neat code bumming!
Now since what I REALLY wanted was a function to multiply an 8.24 binary fraction by an unsigned long int where the integer part of the result will never be larger than 32 bits I did a litte re-aranging of the registers again (along with your code bumming) This function will multiply a 32 bit int by an 8.24 binary fraction and return a 32 bit int. The 24 bit fractional part and the impossible (for the range of numbers I expect to feed this) upper 8 bits of the result are truncated. Result returned in R25-R22.

mult3232:
        push    r2              ; save working registers
        push    r3              ; 
        push    r4
        push    r5
        movw     r30,r24         ; copy arguments to working registers
        movw     r26,r22         ; 
        movw     r4,r20          ; 
        movw     r2,r18          ; 
;;;
;;;                   (r31 r30 r27 r26) a[0-3]
;;;               X   ( r5  r4  r3  r2) b[0-3]
;;; -----------------------------------
;;; (MSB r25 r24 r23 r22 . r20 r19 r18) p[0-6]
;;;
;;; MSB thown out, return r25-r22, ignore fraction (r20-r18)
;;; 
        ;; mult by ls digit of multiplier
        mul     r26,r2          ; a0xb0
		movw	r18,r0          ; result in p1-p0
		clr     r20             ; clear p2-p6
		clr     r21             ; using r21 as zero register
        movw    r22,r20         ; clear r22,r23
        movw    r24,r20         ; clear r24,r25
		mul     r27,r2          ; a1xb0
		add     r19,r0          ; add at p1
		adc     r20,r1          ; add in msp at p2 with carry
		mul     r30,r2          ; a2xb0
		add     r20,r0          ; add at p2
		adc     r22,r1          ; include msp and  carry in p3
		mul     r31,r2          ; a3xb0
		add     r22,r0          ; add at p3
		adc     r23,r1          ; include msp and carry in p4
        ;; mult by 2nd digit of multiplier
		mul     r26,r3          ; a0xb1
		add     r19,r0          ; add at p1
		adc     r20,r1          ; add msp and carry at p2
		adc     r22,r21         ; add carry to p3
		adc     r23,r21         ; add carry to p4
		adc     r24,r21         ; add carry to p5
		mul     r27,r3          ; a1xb1
		add     r20,r0          ; add at p2
		adc     r22,r1          ; add msp and carry at p3
		adc     r23,r21         ; add carry to p4
		adc     r24,r21         ; add carry to p5
		mul     r30,r3          ; a2xb1
		add     r22,r0          ; add at p3
		adc     r23,r1          ; add msp and carry at p4
		adc     r24,r21         ; add carry at p5
		mul     r31,r3          ; a3xb1
		add     r23,r0          ; add at p4
		adc     r24,r1          ; add msp and carry at p5
        ;; mult by 3rd digit of multiplier
		mul     r26,r4          ; a0xb2
		add     r20,r0          ; add at p2
		adc     r22,r1          ; add msp with carry at p3
		adc     r23,r21         ; add carry to p4
		adc     r24,r21         ; add carry to p5
		adc     r25,r21         ; add carry to p6
		mul     r27,r4          ; a1xb2
		add     r22,r0          ; add at p3
		adc     r23,r1          ; add msp with carry at p4
		adc     r24,r21         ; add carry to p5
		adc     r25,r21         ; add carry to p6
		mul     r30,r4          ; a2xb2
		add     r23,r0          ; add at p4
		adc     r24,r1          ; add msp with carry at p5
		adc     r25,r21         ; add carry to p6
		mul     r31,r4          ; a3xb2
		add     r24,r0          ; add at p5
		adc     r25,r1          ; add msp with carry at p6
        ;; mult by 4th digit of multiplier
		mul     r26,r5          ; a0xb3
		add     r22,r0          ; add at p3
		adc     r23,r1          ; add msp with carry at p4
		adc     r24,r21         ; add carry to p5
		adc     r25,r21         ; add carry to p6
		mul     r27,r5          ; a1xb3
		add     r23,r0          ; add at p4
		adc     r24,r1          ; add msp with carry to p5
		adc     r25,r21         ; add carry to p6
		mul     r30,r5          ; a2xb3
		add     r24,r0          ; add at p5
		adc     r25,r1          ; add msp and carry at p6
		mul     r31,r5          ; a3xb3
		add     r25,r0          ; add at p6
        ;; restore registers
        pop     r5
        pop     r4
        pop     r3
        pop     r2
        clr     r1              ; clear zero register
		ret
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Veeery old post, but it does worth to complement the text about saving few bytes and clock cycles in multi-byte multiplication.

 

When multiplying A:B:C:D by E:F:G:H, results to J:K:L:M:N:O:P:Q, if you follow the traditional multiplication scheme like the old school paper and pencil, you will need to propagate the carry after each MUL, except for the first one.  You can save few of those if you just do this: 
 

MUL D, H
MOVW P:Q, R1:R0

MUL C, G
MOVW N:O, R1:R0

MUL B, F
MOVW L:M, R1:R0

MUL A, E
MOVW J:K, R1:R0 

Now you can start the crossing bytes multiplication and ADD, ADC, ADC, propagating carry, etc.

I would love to see AVR instruction set to include INCC Rd (Increment on Carry Set) and DECC Rd instructions, those changing carry bit accordingly on Sreg.

 

Wagner Lipnharski
Orlando Florida USA

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes, that is one choice, the most symmetrical. But there are others, any of the pre-aligned intermediate multiplies will do.

 

        A:B:C:D
    *   E:F:G:H
---------------
            H*D
          H*C
        H*B
      H*A
---------------
          G*D
        G*C
      G*B
    G*A
---------------
        F*D
      F*C
    F*B
  F*A
---------------
      E*D
    E*C
  E*B
E*A
---------------

 

So, for P:Q there is only one choice, H*D;

for N:O there are H*B, G*C and F*D;

for L:M there are G*A, F*B and E*C

and for J:K, only one choice, E*A.

 

It might save a couple of cycles to change only one register between muls, that is, a sequence like H*D, H*B, F*B and (now there is no option but to change 2 registers) E*A.

Last Edited: Tue. Mar 12, 2019 - 08:34 PM