Trying to find out how many clock cycles it takes for atmega32u4 to multiply two unsigned 16 bits integers to an 32 bits integer?
According to the atmega32u4 datasheet (http://www.atmel.com/Images/Atme...) it takes 2 cycles with the ALU, but still it is an 8 bit arm and I have hard time to believe it counts for 16 bits.
Where can I get more in depth information about this matter.? Tried Google, no luck.
Clock cycles for multipling 16 bits with atmega32u4?
it is an 8 bit arm
Eh???
Trying to find out how many clock cycles it takes for atmega32u4 to multiply two unsigned 16 bits integers to an 32 bits integer?
As you note, there is no 'instruction' for that. Rather, code must work 8-bits at a time.
The general method for multiplying multi-byte operands is the same as for pencil-and-paper multiplication in decimal.
For two 2-byte unsigned operands:
- multiply the low bytes of each operand together, get a 16-bit result,
- move that into lower 16-bits of a 32-bit accumulator
- clear the higher 16-bits of the 32-bit accumulator
- multply the low byte of operand A with the high byte of operand B, get a 16-bit result
- shift that result left by 8 bits, add that to the 32-bit accumulator
- multply the high byte of operand A with the low byte of operand B, get a 16-bit result
- shift that result left by 8 bits, add that to the 32-bit accumulator
- multply the high byte of operand A with the high byte of operand B, get a 16-bit result
- shift that result left by 16 bits, add that to the 32-bit accumulator
A naive approach might be:
; operand A in r17:16 ; operand B in r19:18 ; result in r23:20 ; r2 used as zero clr r2 ; our zero ; 1 mul r16, r18 ; LSB A x LSB B ; 2 movw r20, r0 ; to accumulator ; 1 mov r22, r2 ; clr 2 MSB of acc ; 1 mov r23, r2 ; " ; 1 mul r16, r19 ; LSB A x MSB B ; 2 add r21, r0 ; accumulate ; 1 adc r22, r1 ; " ; 1 adc r23, r2 ; " ; 1 mul r17, r18 ; MSB A x LSB B ; 2 add r21, r0 ; accumulate ; 1 adc r22, r1 ; " ; 1 adc r23, r2 ; " ; 1 mul r17, r19 ; MSB A x MSB B ; 2 add r22, r0 ; accumulate ; 1 adc r23, r1 ; " ; 1 ; 20 total
20 cycles, but there is some opportunity to improve that.
The AVR lacks a multiply-and-accumulate instruction, so multiplying multi-byte operand incurs extra cycles for the addition of terms.
This has all been covered ad nauseam. Search with Google or withing these fora. Different toolchains will implement multiply in different ways. Note that a 16x16=32 multiply is not supported by the C standard. Binary operators will promote operands according to the rules of promotion, making them the same width and type, then perform the operation at the same width as the promoted operands. So a 16x16 will always yield a 16.
Maybe you could tell us what your end goal is, and a bit more about your development environment (toolchain, device, etc.)
or examine the library source ...
Thank joeymorin for that detaiied answer!
I am implementing a lowpass filter function to make is sweep between values . It is for handling key slides in a synth.
void glide(uint16_t in, uint16_t & out) { static uint16_t accumilator = 0; uint16_t factor = analogRead(KNOB_1_PIN); else if(in==out) return; out = ((uint32_t)accumilator * (1024 - factor)) >> 10; accumilator = accumilator - out + in; }
The code does not work after I tested more thoroughly. for lower values on variable 'factor' (0-1024). 16 bits multiply does not give 32 bits, I could know..
Need to find another solution where I can scale 'accumilator' linear.
Working in arduino 1.8.2, using additional manager url : https://adafruit.github.io/ardui... to get arduino leornardo board with midi.
Accidentally answered your questions on another comment. See my reply above.
void glide(uint16_t in, uint16_t & out) { static uint16_t accumilator = 0; uint16_t factor = analogRead(KNOB_1_PIN); else if(in==out) return; out = ((uint32_t)accumilator * (1024 - factor)) >> 10; accumilator = accumilator - out + in; }
Huh? Are those typos? Please tell me you're not retyping code into a post. Never do that. You will get it wrong, and we will chase our tails. Only ever copy/paste code from your IDE into a post.
OK, so you're maintaining a 16-bit accumulator (not accumilator), then you cast it to 32-bit before multiplying with a term which could be anywhere from 1 to 1024. That's fine, it will fit into 32 bits no problem. Then you divide that result by 1024, so that result will never be any larger than the original value of your accumulator, which is good because you're assigning that result to the 16-bit argument out.
However, you then adjust accumulator by subtracting the (new) value of out and adding the value of in.
Let's go over what happens when factor is small, say 0: out will equal accumulator, and then accumulator will equal in. Is that what you want?
Say what is happening. Say what you want to happen. You say you want to scale accumulator. Looks like that's what you're doing.
So why are you concerned about cycles?
Joey, I agree the orphaned else looks wrong but why is the & out reference parameter wrong? Surely the very fact it is "out" calls for either a C pointer or a C++ reference?
As I've said a few times in the past, I know enough C++ to get my face slapped at a party ;-)
avr-gcc will issue a library call for this, hence there is extra penalty. For ATmega32u4
#include <stdint.h> uint32_t mul (uint16_t a, uint16_t b) { return (uint32_t) a * b; }
will turn into something like asm:
mul: movw r18,r24 movw r26,r22 call __umulhisi3 ret
disassembly:
0000008c <mul>: 8c: 9c 01 movw r18, r24 8e: db 01 movw r26, r22 90: 0c c0 rjmp .+24 ; 0xaa <__umulhisi3> ... 000000aa <__umulhisi3>: aa: a2 9f mul r26, r18 ac: b0 01 movw r22, r0 ae: b3 9f mul r27, r19 b0: c0 01 movw r24, r0 b2: a3 9f mul r26, r19 b4: 70 0d add r23, r0 b6: 81 1d adc r24, r1 b8: 11 24 eor r1, r1 ba: 91 1d adc r25, r1 bc: b2 9f mul r27, r18 be: 70 0d add r23, r0 c0: 81 1d adc r24, r1 c2: 11 24 eor r1, r1 c4: 91 1d adc r25, r1 c6: 08 95 ret
So you can estimate the worst case execution time (WCET) under the assumption that the [R]JUMP might be a [R]CALL. If a register is multiplied with a constant, execution time might increase or decrease. For example, mul with 0x8000 will lead to a slow shift loop even with -O2. One ugly hack around this is to take away from gcc the knowledge about the value of the constant like so:
uint32_t mul (uint16_t a) { uint16_t b = 0x8000; __asm ("" : "+r" (b)); return (uint32_t) a * b; }
A quick way to figure out how many cycles it takes is to run the code with the simulator. The simulator in Atmel Studio is based on the RTL code and is cycle accurate.
Put breakpoints at the start and the end of the multiplication routine, run the simulation and notice the cycle counter at the start and the end of the multiplication (or reset the cycle counter at the start).
>> A quick way to figure out how many cycles it takes is to run the code with the simulator.
This will yield a lower bound, not an upper bound like WCET.
A quick, empirical, way to figure out how long it takes (time, rather than cycles) is to toggle a pin at start & end of the process - then just measure the time.
My bad!! my bad. Feels assamed.
I learned some assembler this past days with help of this forum.
Decided that 'factor' variable should be a midi control value eg max 2^7 -1.
Then I could reduce the calculation
((uint32_t)accumilator * (1024 - factor)) >> 10;
To this
inline uint16_t mulu8x16div8_16(uint8_t uint8in, uint16_t uint16in) { uint16_t res; asm volatile ( "clr %B0 \n\t" "mul %A1, %A2 \n\t" "mov %A0, __zero_reg__ \n\t" "mul %A1, %B2 \n\t" "add %A0, __tmp_reg__ \n\t" "adc %B0, __zero_reg__ \n\t" "clr r1 \n\t" : "=&r" (res) : "a" (uint8in), "a" (uint16in) ); return res; }
disassembly:
36a: 28 2f mov r18, r24
36c: 99 27 eor r25, r25
36e: 26 9f mul r18, r22
370: 81 2d mov r24, r1
372: 27 9f mul r18, r23
374: 80 0d add r24, r0
376: 91 1d adc r25, r1
378: 11 24 eor r1, r1
Instead of umulhisi, smaller assembler and constant execution time, a clear improvement.
A quick way to figure out how many cycles it takes is to run the code with the simulator. The simulator in Atmel Studio is based on the RTL code and is cycle accurate.
I am running Linux.
A quick, empirical, way to figure out how long it takes (time, rather than cycles) is to toggle a pin at start & end of the process - then just measure the time.
indeed I often do this with IRQ's, not only to see how long they take but also (if I have enough pins) to confirm that they are all actually firing.
for those who do not yet have a scope or logic analyser bitscope make a nice cheap unit that works with Linux (as well as mac & Window$)