Clock cycles for multipling 16 bits with atmega32u4?

16 posts / 0 new
Author
Message

Trying to find out how many clock cycles it takes for atmega32u4 to multiply two unsigned 16 bits integers to an 32 bits integer?
According to the atmega32u4 datasheet (http://www.atmel.com/Images/Atme...) it takes 2 cycles with the ALU, but still it is an 8 bit arm and I have hard time to believe it counts for 16 bits.

Last Edited: Sat. Sep 30, 2017 - 05:18 PM

goto666goto666 wrote:
it is an 8 bit arm

Eh???

Trying to find out how many clock cycles it takes for atmega32u4 to multiply two unsigned 16 bits integers to an 32 bits integer?

As you note, there is no 'instruction' for that.  Rather, code must work 8-bits at a time.

The general method for multiplying multi-byte operands is the same as for pencil-and-paper multiplication in decimal.

For two 2-byte unsigned operands:

• multiply the low bytes of each operand together, get a 16-bit result,
• move that into lower 16-bits of a 32-bit accumulator
• clear the higher 16-bits of the 32-bit accumulator
• multply the low byte of operand A with the high byte of operand B, get a 16-bit result
• shift that result left by 8 bits, add that to the 32-bit accumulator
• multply the high byte of operand A with the low byte of operand B, get a 16-bit result
• shift that result left by 8 bits, add that to the 32-bit accumulator
• multply the high byte of operand A with the high byte of operand B, get a 16-bit result
• shift that result left by 16 bits, add that to the 32-bit accumulator

A naive approach might be:

```; operand A in r17:16
; operand B in r19:18
; result in r23:20
; r2 used as zero

clr     r2          ; our zero          ; 1
mul     r16, r18    ; LSB A x LSB B     ; 2
movw    r20, r0     ; to accumulator    ; 1
mov     r22, r2     ; clr 2 MSB of acc  ; 1
mov     r23, r2     ; "                 ; 1
mul     r16, r19    ; LSB A x MSB B     ; 2
add     r21, r0     ; accumulate        ; 1
adc     r22, r1     ; "                 ; 1
adc     r23, r2     ; "                 ; 1
mul     r17, r18    ; MSB A x LSB B     ; 2
add     r21, r0     ; accumulate        ; 1
adc     r22, r1     ; "                 ; 1
adc     r23, r2     ; "                 ; 1
mul     r17, r19    ; MSB A x MSB B     ; 2
add     r22, r0     ; accumulate        ; 1
adc     r23, r1     ; "                 ; 1
; 20 total
```

20 cycles, but there is some opportunity to improve that.

The AVR lacks a multiply-and-accumulate instruction, so multiplying multi-byte operand incurs extra cycles for the addition of terms.

This has all been covered ad nauseam.  Search with Google or withing these fora.  Different toolchains will implement multiply in different ways.  Note that a 16x16=32 multiply is not supported by the C standard.  Binary operators will promote operands according to the rules of promotion, making them the same width and type, then perform the operation at the same width as the promoted operands.  So a 16x16 will always yield a 16.

Maybe you could tell us what your end goal is, and a bit more about your development environment (toolchain, device, etc.)

 "Experience is what enables you to recognise a mistake the second time you make it." "Good judgement comes from experience.  Experience comes from bad judgement." "When you hear hoofbeats, think horses, not unicorns." "Fast.  Cheap.  Good.  Pick two." "Read a lot.  Write a lot." "We see a lot of arses on handlebars around here." - [J Ekdahl]

Last Edited: Sat. Sep 30, 2017 - 06:26 PM

or examine the library source ...

Thank joeymorin for that detaiied answer!

I am implementing a lowpass filter function to make is sweep between values . It is for handling key slides in a synth.

```void glide(uint16_t in, uint16_t & out)
{
static uint16_t accumilator = 0;
else if(in==out) return;
out = ((uint32_t)accumilator * (1024 - factor)) >> 10;
accumilator = accumilator - out + in;
}```

The code does not work after I tested more thoroughly. for lower values on variable 'factor'  (0-1024). 16 bits multiply does not give 32 bits, I could know..

Need to find another solution where I can scale 'accumilator' linear.

Working in arduino 1.8.2, using additional manager url : https://adafruit.github.io/ardui... to get arduino leornardo board with midi.

Last Edited: Sat. Sep 30, 2017 - 08:27 PM

```void glide(uint16_t in, uint16_t & out)
{
static uint16_t accumilator = 0;
else if(in==out) return;
out = ((uint32_t)accumilator * (1024 - factor)) >> 10;
accumilator = accumilator - out + in;
}```

Huh?  Are those typos?  Please tell me you're not retyping code into a post.  Never do that.  You will get it wrong, and we will chase our tails.  Only ever copy/paste code from your IDE into a post.

OK, so you're maintaining a 16-bit accumulator (not accumilator), then you cast it to 32-bit before multiplying with a term which could be anywhere from 1 to 1024.  That's fine, it will fit into 32 bits no problem.  Then you divide that result by 1024, so that result will never be any larger than the original value of your accumulator, which is good because you're assigning that result to the 16-bit argument out.

However, you then adjust accumulator by subtracting the (new) value of out and adding the value of in.

Let's go over what happens when factor is small, say 0:  out will equal accumulator, and then accumulator will equal in.  Is that what you want?

Say what is happening.  Say what you want to happen.  You say you want to scale accumulator.  Looks like that's what you're doing.

So why are you concerned about cycles?

 "Experience is what enables you to recognise a mistake the second time you make it." "Good judgement comes from experience.  Experience comes from bad judgement." "When you hear hoofbeats, think horses, not unicorns." "Fast.  Cheap.  Good.  Pick two." "Read a lot.  Write a lot." "We see a lot of arses on handlebars around here." - [J Ekdahl]

Joey, I agree the orphaned else looks wrong but why is the & out reference parameter wrong? Surely the very fact it is "out" calls for either a C pointer or a C++ reference?

As I've said a few times in the past, I know enough C++ to get my face slapped at a party ;-)

 "Experience is what enables you to recognise a mistake the second time you make it." "Good judgement comes from experience.  Experience comes from bad judgement." "When you hear hoofbeats, think horses, not unicorns." "Fast.  Cheap.  Good.  Pick two." "Read a lot.  Write a lot." "We see a lot of arses on handlebars around here." - [J Ekdahl]

avr-gcc will issue a library call for this, hence there is extra penalty.  For ATmega32u4

```#include <stdint.h>

uint32_t mul (uint16_t a, uint16_t b)
{
return (uint32_t) a * b;
}
```

will turn into something like asm:

```mul:
movw r18,r24
movw r26,r22
call __umulhisi3
ret
```

disassembly:

```0000008c <mul>:
8c:	9c 01       	movw	r18, r24
8e:	db 01       	movw	r26, r22
90:	0c c0       	rjmp	.+24     	; 0xaa <__umulhisi3>

...

000000aa <__umulhisi3>:
aa:	a2 9f       	mul	r26, r18
ac:	b0 01       	movw	r22, r0
ae:	b3 9f       	mul	r27, r19
b0:	c0 01       	movw	r24, r0
b2:	a3 9f       	mul	r26, r19
b4:	70 0d       	add	r23, r0
b6:	81 1d       	adc	r24, r1
b8:	11 24       	eor	r1, r1
ba:	91 1d       	adc	r25, r1
bc:	b2 9f       	mul	r27, r18
be:	70 0d       	add	r23, r0
c0:	81 1d       	adc	r24, r1
c2:	11 24       	eor	r1, r1
c4:	91 1d       	adc	r25, r1
c6:	08 95       	ret
```

So you can estimate the worst case execution time (WCET) under the assumption that the [R]JUMP might be a [R]CALL.  If a register is multiplied with a constant, execution time might increase or decrease.  For example, mul with 0x8000 will lead to a slow shift loop even with -O2.  One ugly hack around this is to take away from gcc the knowledge about the value of the constant like so:

```uint32_t mul (uint16_t a)
{
uint16_t b = 0x8000;
__asm ("" : "+r" (b));
return (uint32_t) a * b;
}
```

avrfreaks does not support Opera. Profile inactive.

A quick way to figure out how many cycles it takes is to run the code with the simulator. The simulator in Atmel Studio is based on the RTL code and is cycle accurate.

Put breakpoints at the start and the end of the multiplication routine, run the simulation and notice the cycle counter at the start and the end of the multiplication (or reset the cycle counter at the start).

>> A quick way to figure out how many cycles it takes is to run the code with the simulator.

This will yield a lower bound, not an upper bound like WCET.

avrfreaks does not support Opera. Profile inactive.

A quick, empirical, way to figure out how long it takes (time, rather than cycles) is to toggle a pin at start & end of the process - then just measure the time.

I learned some assembler this past days with help of this forum.

Decided that 'factor' variable should be a midi control value eg max 2^7 -1.

Then I could reduce the calculation

`((uint32_t)accumilator * (1024 - factor)) >> 10;`

To this

```inline uint16_t mulu8x16div8_16(uint8_t uint8in, uint16_t uint16in)
{
uint16_t res;
asm volatile (
"clr %B0 \n\t"
"mul %A1, %A2 \n\t"
"mov %A0, __zero_reg__ \n\t"
"mul %A1, %B2 \n\t"
"clr r1 \n\t"
:
"=&r" (res)
:
"a" (uint8in),
"a" (uint16in)
);
return res;
}```

disassembly:

36a:       28 2f           mov     r18, r24
36c:       99 27           eor     r25, r25
36e:       26 9f           mul     r18, r22
370:       81 2d           mov     r24, r1
372:       27 9f           mul     r18, r23
374:       80 0d           add     r24, r0
376:       91 1d           adc     r25, r1
378:       11 24           eor     r1, r1

Instead of umulhisi, smaller assembler and constant execution time, a clear improvement.

A quick way to figure out how many cycles it takes is to run the code with the simulator. The simulator in Atmel Studio is based on the RTL code and is cycle accurate.

I am running Linux.