Clock cycles for multipling 16 bits with atmega32u4?

Go To Last Post
16 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Trying to find out how many clock cycles it takes for atmega32u4 to multiply two unsigned 16 bits integers to an 32 bits integer?
According to the atmega32u4 datasheet (http://www.atmel.com/Images/Atme...) it takes 2 cycles with the ALU, but still it is an 8 bit arm and I have hard time to believe it counts for 16 bits.
Where can I get more in depth information about this matter.? Tried Google, no luck.

Last Edited: Sat. Sep 30, 2017 - 05:18 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

goto666goto666 wrote:
it is an 8 bit arm

Eh???

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

Trying to find out how many clock cycles it takes for atmega32u4 to multiply two unsigned 16 bits integers to an 32 bits integer?

As you note, there is no 'instruction' for that.  Rather, code must work 8-bits at a time.

 

The general method for multiplying multi-byte operands is the same as for pencil-and-paper multiplication in decimal.

 

For two 2-byte unsigned operands:

  • multiply the low bytes of each operand together, get a 16-bit result,
  • move that into lower 16-bits of a 32-bit accumulator
  • clear the higher 16-bits of the 32-bit accumulator
  • multply the low byte of operand A with the high byte of operand B, get a 16-bit result
  • shift that result left by 8 bits, add that to the 32-bit accumulator
  • multply the high byte of operand A with the low byte of operand B, get a 16-bit result
  • shift that result left by 8 bits, add that to the 32-bit accumulator
  • multply the high byte of operand A with the high byte of operand B, get a 16-bit result
  • shift that result left by 16 bits, add that to the 32-bit accumulator

 

A naive approach might be:

; operand A in r17:16
; operand B in r19:18
; result in r23:20
; r2 used as zero

    clr     r2          ; our zero          ; 1
    mul     r16, r18    ; LSB A x LSB B     ; 2
    movw    r20, r0     ; to accumulator    ; 1
    mov     r22, r2     ; clr 2 MSB of acc  ; 1
    mov     r23, r2     ; "                 ; 1
    mul     r16, r19    ; LSB A x MSB B     ; 2
    add     r21, r0     ; accumulate        ; 1
    adc     r22, r1     ; "                 ; 1
    adc     r23, r2     ; "                 ; 1
    mul     r17, r18    ; MSB A x LSB B     ; 2
    add     r21, r0     ; accumulate        ; 1
    adc     r22, r1     ; "                 ; 1
    adc     r23, r2     ; "                 ; 1
    mul     r17, r19    ; MSB A x MSB B     ; 2
    add     r22, r0     ; accumulate        ; 1
    adc     r23, r1     ; "                 ; 1
                                            ; 20 total

20 cycles, but there is some opportunity to improve that.

 

The AVR lacks a multiply-and-accumulate instruction, so multiplying multi-byte operand incurs extra cycles for the addition of terms.

 

This has all been covered ad nauseam.  Search with Google or withing these fora.  Different toolchains will implement multiply in different ways.  Note that a 16x16=32 multiply is not supported by the C standard.  Binary operators will promote operands according to the rules of promotion, making them the same width and type, then perform the operation at the same width as the promoted operands.  So a 16x16 will always yield a 16.

 

Maybe you could tell us what your end goal is, and a bit more about your development environment (toolchain, device, etc.)

 

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"Read a lot.  Write a lot."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Sat. Sep 30, 2017 - 06:26 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

or examine the library source ...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thank joeymorin for that detaiied answer!

I am implementing a lowpass filter function to make is sweep between values . It is for handling key slides in a synth.

 

void glide(uint16_t in, uint16_t & out)
{
  static uint16_t accumilator = 0;
  uint16_t factor = analogRead(KNOB_1_PIN);
  else if(in==out) return;
  out = ((uint32_t)accumilator * (1024 - factor)) >> 10;
  accumilator = accumilator - out + in;
}

The code does not work after I tested more thoroughly. for lower values on variable 'factor'  (0-1024). 16 bits multiply does not give 32 bits, I could know..

Need to find another solution where I can scale 'accumilator' linear.

 

Working in arduino 1.8.2, using additional manager url : https://adafruit.github.io/ardui... to get arduino leornardo board with midi.

Last Edited: Sat. Sep 30, 2017 - 08:27 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Accidentally answered your questions on another comment. See my reply above. 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
void glide(uint16_t in, uint16_t & out)
{
  static uint16_t accumilator = 0;
  uint16_t factor = analogRead(KNOB_1_PIN);
  else if(in==out) return;
  out = ((uint32_t)accumilator * (1024 - factor)) >> 10;
  accumilator = accumilator - out + in;
}

Huh?  Are those typos?  Please tell me you're not retyping code into a post.  Never do that.  You will get it wrong, and we will chase our tails.  Only ever copy/paste code from your IDE into a post.

 

OK, so you're maintaining a 16-bit accumulator (not accumilator), then you cast it to 32-bit before multiplying with a term which could be anywhere from 1 to 1024.  That's fine, it will fit into 32 bits no problem.  Then you divide that result by 1024, so that result will never be any larger than the original value of your accumulator, which is good because you're assigning that result to the 16-bit argument out.

 

However, you then adjust accumulator by subtracting the (new) value of out and adding the value of in.

 

Let's go over what happens when factor is small, say 0:  out will equal accumulator, and then accumulator will equal in.  Is that what you want?

 

Say what is happening.  Say what you want to happen.  You say you want to scale accumulator.  Looks like that's what you're doing.

 

So why are you concerned about cycles?

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"Read a lot.  Write a lot."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Joey, I agree the orphaned else looks wrong but why is the & out reference parameter wrong? Surely the very fact it is "out" calls for either a C pointer or a C++ reference?
 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

As I've said a few times in the past, I know enough C++ to get my face slapped at a party ;-)

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"Read a lot.  Write a lot."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

avr-gcc will issue a library call for this, hence there is extra penalty.  For ATmega32u4

 

#include <stdint.h>

uint32_t mul (uint16_t a, uint16_t b)
{
    return (uint32_t) a * b;
}

will turn into something like asm:

mul:
	movw r18,r24
	movw r26,r22
	call __umulhisi3
	ret

disassembly:

0000008c <mul>:
  8c:	9c 01       	movw	r18, r24
  8e:	db 01       	movw	r26, r22
  90:	0c c0       	rjmp	.+24     	; 0xaa <__umulhisi3>

...

000000aa <__umulhisi3>:
  aa:	a2 9f       	mul	r26, r18
  ac:	b0 01       	movw	r22, r0
  ae:	b3 9f       	mul	r27, r19
  b0:	c0 01       	movw	r24, r0
  b2:	a3 9f       	mul	r26, r19
  b4:	70 0d       	add	r23, r0
  b6:	81 1d       	adc	r24, r1
  b8:	11 24       	eor	r1, r1
  ba:	91 1d       	adc	r25, r1
  bc:	b2 9f       	mul	r27, r18
  be:	70 0d       	add	r23, r0
  c0:	81 1d       	adc	r24, r1
  c2:	11 24       	eor	r1, r1
  c4:	91 1d       	adc	r25, r1
  c6:	08 95       	ret

So you can estimate the worst case execution time (WCET) under the assumption that the [R]JUMP might be a [R]CALL.  If a register is multiplied with a constant, execution time might increase or decrease.  For example, mul with 0x8000 will lead to a slow shift loop even with -O2.  One ugly hack around this is to take away from gcc the knowledge about the value of the constant like so:

uint32_t mul (uint16_t a)
{
    uint16_t b = 0x8000;
    __asm ("" : "+r" (b));
    return (uint32_t) a * b;
}

 

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

A quick way to figure out how many cycles it takes is to run the code with the simulator. The simulator in Atmel Studio is based on the RTL code and is cycle accurate.

Put breakpoints at the start and the end of the multiplication routine, run the simulation and notice the cycle counter at the start and the end of the multiplication (or reset the cycle counter at the start).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

>> A quick way to figure out how many cycles it takes is to run the code with the simulator.

 

This will yield a lower bound, not an upper bound like WCET.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

A quick, empirical, way to figure out how long it takes (time, rather than cycles) is to toggle a pin at start & end of the process - then just measure the time.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

My bad!! my bad. Feels assamed.

I learned some assembler this past days with help of this forum.

 

Decided that 'factor' variable should be a midi control value eg max 2^7 -1.

Then I could reduce the calculation 

((uint32_t)accumilator * (1024 - factor)) >> 10;

To this

 

inline uint16_t mulu8x16div8_16(uint8_t uint8in, uint16_t uint16in)
{
  uint16_t res;
  asm volatile (
  "clr %B0 \n\t"
  "mul %A1, %A2 \n\t"
  "mov %A0, __zero_reg__ \n\t"
  "mul %A1, %B2 \n\t"
  "add %A0, __tmp_reg__ \n\t"
  "adc %B0, __zero_reg__ \n\t"
  "clr r1 \n\t"
  :
  "=&r" (res)
  :
  "a" (uint8in),
  "a" (uint16in)
  );
  return res;
}

 

disassembly:

 

     36a:       28 2f           mov     r18, r24
     36c:       99 27           eor     r25, r25
     36e:       26 9f           mul     r18, r22
     370:       81 2d           mov     r24, r1
     372:       27 9f           mul     r18, r23
     374:       80 0d           add     r24, r0
     376:       91 1d           adc     r25, r1
     378:       11 24           eor     r1, r1

 

Instead of umulhisi, smaller assembler and constant execution time, a clear improvement.

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

A quick way to figure out how many cycles it takes is to run the code with the simulator. The simulator in Atmel Studio is based on the RTL code and is cycle accurate.

 

I am running Linux.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

awneil wrote:
A quick, empirical, way to figure out how long it takes (time, rather than cycles) is to toggle a pin at start & end of the process - then just measure the time.

indeed I often do this with IRQ's, not only to see how long they take but also (if I have enough pins) to confirm that they are all actually firing.

 

for those who do not yet have a scope or logic analyser  bitscope make a nice cheap unit that works with Linux (as well as mac & Window$)