## [ASM] modulus by 12 integer constant

18 posts / 0 new
Author
Message

I did a search here as I was looking for a way to use shifts and subtractions or reciprocal multiplication to divide a number by 12 for the modulus.

I have a formula for BCD that works for divide by 10.

There is only one place in the program where the mod 12 happens. Normally I would use the integer divide routines from the AVR tech notes.

Since this is only use in one place and the division is always by 12. The range of the dividend is at most 128 normally it will be 61 or 96. The quotient gets thrown away. (yes this is a music program.)

I did find some stuff her with a search on reciprocal multiplication as the target processor is a mega328 I have and use the mul instruction. I did try the 256/12 but there is not enough precision. Most examples are in C and deal with 32 bit integers. These examples are also for division rather than finding the modulus (I do not think that is called modulation though.)

Most of the examples go into theory. I am looking for more of a library template that I can drop in.

code size is not an issue but speed is.

Successive subtraction? Seems like a limited and relatively well defined case like yours would make that easy. Especially, since the dividend is only 8 bits.

Jim

Until Black Lives Matter, we do not have "All Lives Matter"!

Quote:
Successive subtraction?
I have tried in simul.

```mod12:
subi r16,12 ;number in r16
cpi  r16,12
brsh mod12
ret            ;remainder in r16```

Takes 47 cycles for number 128.

Shirley you just multiply by 22 and take the hi-byte.

```uint8_t n;
result = (n * 22) >> 8;   // 256/12 = 21.33, round up to 22
```

This is effectively LDI, MUL, MOV so it does not cost very much at all. If you keep 22 in a register, it only costs MUL, MOV

Obviously you need a Mega to have the MUL instruction.

Untested.

David.

Oops. You want the modulus and not the quotient. So I suppose you then need to subtract quotient*12 afterwards. This is probably an extra LDI, MUL, SUB. i.e. still not expensive.
Mind you, it is worth looking to see what the C compiler does for

```volatile uint8_t n;
uint8_t result = n % 12;```

Quote:
it is worth looking to see what the C compiler does
With:

```int main (void)
{
volatile uint8_t result = PINB % 12;
}```

The pertinent bits are:

```  50:	83 b1       	in	r24, 0x03	; 3
52:	6c e0       	ldi	r22, 0x0C	; 12
54:	07 d0       	rcall	.+14     	; 0x64 <__udivmodqi4>
56:	99 83       	std	Y+1, r25	; 0x01
```
```00000064 <__udivmodqi4>:
64:	99 1b       	sub	r25, r25
66:	79 e0       	ldi	r23, 0x09	; 9
68:	04 c0       	rjmp	.+8      	; 0x72 <__udivmodqi4_ep>

0000006a <__udivmodqi4_loop>:
6a:	99 1f       	adc	r25, r25
6c:	96 17       	cp	r25, r22
6e:	08 f0       	brcs	.+2      	; 0x72 <__udivmodqi4_ep>
70:	96 1b       	sub	r25, r22

00000072 <__udivmodqi4_ep>:
72:	88 1f       	adc	r24, r24
74:	7a 95       	dec	r23
76:	c9 f7       	brne	.-14     	; 0x6a <__udivmodqi4_loop>
78:	80 95       	com	r24
7a:	08 95       	ret
```

GCC uses r24 for the first parameter (PINB) and r22 for the second (12).

But as the OP says the input range is limied, code size is not an issue but speed is I wonder if a lookup table would be simplest?

multiply with 22 is to far off!
first error with 47.
mul with 171 and shift should work.

Quote:
Successive subtraction?

Binary search with successive subtraction?

```unsigned char r12a(unsigned char n)
{
if (n >= 120)
n -= 120;  //12 * 10
if (n >= 60)
n -= 60;  // 12 * 5
if (n >= 36)
n -= 36;
if (n >= 24)
n -= 24;
if (n >= 12)
n -= 12;
return n;
}```

15 cycles, isochronous, and works 0..255, I think? Compiles to:

```      cpi     r24, 0x78       ; 120
brcs    .+2
subi    r24, 0x78       ; 120
cpi     r24, 0x3C       ; 60
brcs    .+2
subi    r24, 0x3C       ; 60
cpi     r24, 0x24       ; 36
brcs    .+2
subi    r24, 0x24       ; 36
cpi     r24, 0x18       ; 24
brcs    .+2
subi    r24, 0x18       ; 24
cpi     r24, 0x0C       ; 12
brcs    .+2
subi    r24, 0x0C       ; 12
ret
```

I feel like you should be able to do something with bittests and subtracting powers of two, which would carve off some instructions. But I can't quite see it.

For a range of "only" 0..128, you could do a table lookup...

Just make table for modulo and division by 12 ? I guess 128 entries?

You can shave off another 4 cycles (only works up to dividend = 191)

```unsigned char r12b(unsigned char n)
{
if (n >= 96)
n -= 96;  //12 * 8
if (n >= 48)
n -= 48;  // 12 * 4
if (n >= 24)
n -= 24;
if (n >= 12)
n -= 12;
return n;
}
```

Yes, it takes a few cycles. You can either multiply by 171 and shift 3 or multiply by 22 and adjust for -ve.

```#include "portab_mcu.h"
#include

extern void initstdio(void);

uint8_t quotient12(uint8_t n)
{
#asm
mov r30, r26 ;n
ldi r27, 171 ;multiplier
mul r26, r27
mov r30, r1  ;quotient is in hi
lsr r30
lsr r30
lsr r30
#endasm
}

uint8_t modulo12(uint8_t n)
{
#asm
mov r30, r26 ;n
ldi r27, 171 ;multiplier 256 / 12 = 21.33 or 171 / 8
mul r26, r27
lsr r1       ;div8
lsr r1
lsr r1
ldi r27, 12
mul r1, r27  ;quotient *12
sub r30, r0  ;n -= quotient * 12
#endasm
}

uint8_t _modulo12(uint8_t n)
{
#asm
mov r30, r26 ;n
ldi r27, 22  ;multiplier 256 / 12 = 21.33 or 22 / 1
mul r26, r27
ldi r27, 12
mul r1, r27  ;quotient *12
sub r30, r0  ;n -= quotient * 12
subi r30, -12
is_plus:
#endasm
}

void main(void)
{
volatile uint8_t n;
uint8_t result, fast, quot, errors = 0;
initstdio();
printf("Hello modulo 12\r\n");
for (n = 0; n < 255; n++) {
result = n % 12;
quot = quotient12(n);
fast = modulo12(n);
if (result != fast || quot != (n / 12)) {
printf("%3d %3d\t%3d\t%3d\r\n", n, result, fast, quot);
errors++;
}
}
printf("errors = %2d\r\n", errors);
while (1);
}
```

Obviously, GCC uses different scratch registers. But the principle is much the same.

I would reckon that inlining two MULs will be a lot quicker than calling a subroutine or anything involving loops.

David.

Edit. I make modulo12() 11 cycles and _modulo12() 10 cycles.

The successive subtraction can get even faster by leaving out the compare inside the loop and just add 12 after getting negative.

```mod12:
subi r16,12      ; number in r16
bcc  mod12
ret            ;remainder in r16
```

As I see it, the loop is 3 cycles. So any more than 3 loops will be longer than using MUL.

I bet that Sparrow2 can come back with a really fast routine.

David.

If 128 is the max input value it's ok to mul with 43 (first error at 131), so Davids code can save two LSR.

But I have to say that I would go for the Binary search code because it only change one register.
And if real speed is needed use a lookup table. To avoid range check it should be 256 bytes. And to save an ADD placed on a page, so ZL is the input value.
Perhaps place (make) the table in RAM to save 1 clk, but more imported then X and Y also can be used as pointers, if that it is any help depends of the rest of the program.

Edit error in text

```LDI ZH,page
MOV ZL,r24 ; depends where input comes from perhaps not needed
LPM R24,Z
```

So 5 clk perhaps 4 if input already placed in ZL.

There is a lot of good stuff here to choose from. I like the solutions from westfw and Kleinstein for their simplicity. Most likely I will implement Kleinstein's successive subtraction loop first and measure the performance.

In practice a table lookup with Z index fits the solution the best. The processor is a mega328, I have to use the mega328 to get enough sram for the note state buffers.

There is one more interesting variation of the substraction version:
First substract something like 48 or 36 until going negative, and than add 12 until going positive again. This should be slightly faster with only one more word.

```mod12a:
subi r16,48      ; number in r16
bcc  mod12a
mod12b:
bcs  mod12b
ret            ;remainder in r16
```

Still the multiply and adjust version is likely still faster - though using more registers, which may need extra operations.

There is also a variation of westfw's way: Its faster to just check Bit instead of values like 96. So add something like

```  sbrc R16,7
subi R16,120
```

in front of the substraction code. Which is the ASM version of
IF n>=128 n-=120

Tips to speed it up:
place lookup on pages.
if you can't get index directly to ZL then try to place
it as a low byte of a register pair where the high byte is fixed to the page number (low registers if possible) so Z can be loaded with a movw instruction.
If Z has to be stored, store it with movw to a low register pair.

```uint_8 mod12[64] = {0,0,0,12,12,12,24,24,24,36,36,36,48,48,48,60,60,60,72,72,72,84,84,84,96,96,96,108,108,108,120,120,120,132,132,132,144,144,144,156,156,156,168,168,168,180,180,180,192,192,192,204,204,204,216,216,216,228,228,228,240,240,240,252};

//Demonstration of mod12 lut.
for (uint16_t i = 0; i < 256; i++) {
uint8_t value = i;
uint8_t result = value - mod12[value >> 2];
printf("mod(%d,12)=%d\n",value,result);
}
```

There's a general trick for fixed mod that we used back in the '80s.

Given the number (here it is 12) and assuming a maximum 8 bits, we can use a simple LUT.  The fastest (but most space hungry) is to use a 256 byte LUT (also necessary for odd number mods).

So, the result is n -= lut12[n];

Space saving can be done by right-shifting out the prime factors of 2.  12 = 2*2*3 which means we can shift n right 2 (the number of 2's in the prime factorisation), and that reduces the LUT to 256/4 = 64 bytes.

That gives us n -= lut12[n>>2];

Each value in the LUT is a multiple of 12, configured so that the LUT value at index 'i'  is always the next multiple up to the LUT index (or if right-shift optimised, the LUT index * 2^ shifts).

So lut12 would be [0,0,0,12,12,12,24,24,24,36,36,36,48,48,48,60,60,60,72,72,72,84,84,84,96,96,96,108,108,108,120,120,120,132,132,132,144,144,144,156,156,156,168,168,168,180,180,180,192,192,192,204,204,204,216,216,216,228,228,228,240,240,240,252]

For example, set n to 59. n >> 2 = 14. lut12[14]=48.  59-48 = 11.

Obviously if the mod constant is a power of 2, one doesn't need a LUT, one just ANDS out the significant bits. (e.g. for n % 32, we do n &= 0x1F; //0x1F = 32-1.)

The LUT method works for all microprocessors - and generally equates to 2-4 instructions. Because there are no jumps or conditionals, the clock time is constant, which is often essential when driving hardware.

Last Edited: Wed. Jan 27, 2016 - 12:57 PM