[ASM] modulus by 12 integer constant

Go To Last Post
18 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I did a search here as I was looking for a way to use shifts and subtractions or reciprocal multiplication to divide a number by 12 for the modulus.

I have a formula for BCD that works for divide by 10.

There is only one place in the program where the mod 12 happens. Normally I would use the integer divide routines from the AVR tech notes.

Since this is only use in one place and the division is always by 12. The range of the dividend is at most 128 normally it will be 61 or 96. The quotient gets thrown away. (yes this is a music program.)

I did find some stuff her with a search on reciprocal multiplication as the target processor is a mega328 I have and use the mul instruction. I did try the 256/12 but there is not enough precision. Most examples are in C and deal with 32 bit integers. These examples are also for division rather than finding the modulus (I do not think that is called modulation though.)

Most of the examples go into theory. I am looking for more of a library template that I can drop in.

code size is not an issue but speed is.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Successive subtraction? Seems like a limited and relatively well defined case like yours would make that easy. Especially, since the dividend is only 8 bits.

Jim

 

Until Black Lives Matter, we do not have "All Lives Matter"!

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
Successive subtraction?
I have tried in simul.

mod12:
   subi r16,12 ;number in r16
   cpi  r16,12
   brsh mod12 
ret            ;remainder in r16

Takes 47 cycles for number 128.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Shirley you just multiply by 22 and take the hi-byte.

uint8_t n;
result = (n * 22) >> 8;   // 256/12 = 21.33, round up to 22

This is effectively LDI, MUL, MOV so it does not cost very much at all. If you keep 22 in a register, it only costs MUL, MOV

Obviously you need a Mega to have the MUL instruction.

Untested.

David.

Oops. You want the modulus and not the quotient. So I suppose you then need to subtract quotient*12 afterwards. This is probably an extra LDI, MUL, SUB. i.e. still not expensive.
Mind you, it is worth looking to see what the C compiler does for

volatile uint8_t n;
uint8_t result = n % 12;
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
it is worth looking to see what the C compiler does
With:

int main (void)
{
    volatile uint8_t result = PINB % 12;
}

The pertinent bits are:

  50:	83 b1       	in	r24, 0x03	; 3
  52:	6c e0       	ldi	r22, 0x0C	; 12
  54:	07 d0       	rcall	.+14     	; 0x64 <__udivmodqi4>
  56:	99 83       	std	Y+1, r25	; 0x01
00000064 <__udivmodqi4>:
  64:	99 1b       	sub	r25, r25
  66:	79 e0       	ldi	r23, 0x09	; 9
  68:	04 c0       	rjmp	.+8      	; 0x72 <__udivmodqi4_ep>

0000006a <__udivmodqi4_loop>:
  6a:	99 1f       	adc	r25, r25
  6c:	96 17       	cp	r25, r22
  6e:	08 f0       	brcs	.+2      	; 0x72 <__udivmodqi4_ep>
  70:	96 1b       	sub	r25, r22

00000072 <__udivmodqi4_ep>:
  72:	88 1f       	adc	r24, r24
  74:	7a 95       	dec	r23
  76:	c9 f7       	brne	.-14     	; 0x6a <__udivmodqi4_loop>
  78:	80 95       	com	r24
  7a:	08 95       	ret

GCC uses r24 for the first parameter (PINB) and r22 for the second (12).

But as the OP says the input range is limied, code size is not an issue but speed is I wonder if a lookup table would be simplest?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

multiply with 22 is to far off!
first error with 47.
mul with 171 and shift should work.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
Successive subtraction?

Binary search with successive subtraction?

unsigned char r12a(unsigned char n)
{
    if (n >= 120)
         n -= 120;  //12 * 10
    if (n >= 60)
         n -= 60;  // 12 * 5
    if (n >= 36)
        n -= 36;
    if (n >= 24)
        n -= 24;
    if (n >= 12)
        n -= 12;
    return n;
}

15 cycles, isochronous, and works 0..255, I think? Compiles to:

      cpi     r24, 0x78       ; 120 
      brcs    .+2
      subi    r24, 0x78       ; 120
      cpi     r24, 0x3C       ; 60
      brcs    .+2
      subi    r24, 0x3C       ; 60
      cpi     r24, 0x24       ; 36
      brcs    .+2
      subi    r24, 0x24       ; 36
      cpi     r24, 0x18       ; 24
      brcs    .+2
      subi    r24, 0x18       ; 24
      cpi     r24, 0x0C       ; 12
      brcs    .+2
      subi    r24, 0x0C       ; 12
      ret

I feel like you should be able to do something with bittests and subtracting powers of two, which would carve off some instructions. But I can't quite see it.

For a range of "only" 0..128, you could do a table lookup...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just make table for modulo and division by 12 ? I guess 128 entries?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You can shave off another 4 cycles (only works up to dividend = 191)

unsigned char r12b(unsigned char n)
{
    if (n >= 96)
	n -= 96;  //12 * 8
    if (n >= 48)
	n -= 48;  // 12 * 4
    if (n >= 24)
	n -= 24;
    if (n >= 12)
	n -= 12;
    return n;
}
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes, it takes a few cycles. You can either multiply by 171 and shift 3 or multiply by 22 and adjust for -ve.

#include "portab_mcu.h"
#include 

extern void initstdio(void);

uint8_t quotient12(uint8_t n)
{
#asm
    mov r30, r26 ;n 
    ldi r27, 171 ;multiplier 
    mul r26, r27 
    mov r30, r1  ;quotient is in hi 
    lsr r30 
    lsr r30 
    lsr r30
#endasm
}

uint8_t modulo12(uint8_t n)
{
#asm
    mov r30, r26 ;n 
    ldi r27, 171 ;multiplier 256 / 12 = 21.33 or 171 / 8 
    mul r26, r27 
    lsr r1       ;div8 
    lsr r1 
    lsr r1 
    ldi r27, 12 
    mul r1, r27  ;quotient *12 
    sub r30, r0  ;n -= quotient * 12
#endasm
}

uint8_t _modulo12(uint8_t n)
{
#asm
    mov r30, r26 ;n 
    ldi r27, 22  ;multiplier 256 / 12 = 21.33 or 22 / 1 
    mul r26, r27 
    ldi r27, 12 
    mul r1, r27  ;quotient *12 
    sub r30, r0  ;n -= quotient * 12 
    brpl is_plus ;adjust for -ve 
    subi r30, -12 
is_plus:
#endasm
} 

void main(void)
{
    volatile uint8_t n;
    uint8_t result, fast, quot, errors = 0;
    initstdio();
    printf("Hello modulo 12\r\n");
    for (n = 0; n < 255; n++) {
        result = n % 12;
        quot = quotient12(n);
        fast = modulo12(n);
        if (result != fast || quot != (n / 12)) {
             printf("%3d %3d\t%3d\t%3d\r\n", n, result, fast, quot);
             errors++;
        }
    }
    printf("errors = %2d\r\n", errors);
    while (1);
}

Obviously, GCC uses different scratch registers. But the principle is much the same.

I would reckon that inlining two MULs will be a lot quicker than calling a subroutine or anything involving loops.

David.

Edit. I make modulo12() 11 cycles and _modulo12() 10 cycles.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The successive subtraction can get even faster by leaving out the compare inside the loop and just add 12 after getting negative.

mod12: 
    subi r16,12      ; number in r16 
    bcc  mod12         
    subi r16,256-12  ; add 12
 ret            ;remainder in r16
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

As I see it, the loop is 3 cycles. So any more than 3 loops will be longer than using MUL.

I bet that Sparrow2 can come back with a really fast routine.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If 128 is the max input value it's ok to mul with 43 (first error at 131), so Davids code can save two LSR.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

But I have to say that I would go for the Binary search code because it only change one register.
And if real speed is needed use a lookup table. To avoid range check it should be 256 bytes. And to save an ADD placed on a page, so ZL is the input value.
Perhaps place (make) the table in RAM to save 1 clk, but more imported then X and Y also can be used as pointers, if that it is any help depends of the rest of the program.

Edit error in text
Add lookup: (placed on page)

LDI ZH,page
MOV ZL,r24 ; depends where input comes from perhaps not needed
LPM R24,Z

So 5 clk perhaps 4 if input already placed in ZL.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There is a lot of good stuff here to choose from. I like the solutions from westfw and Kleinstein for their simplicity. Most likely I will implement Kleinstein's successive subtraction loop first and measure the performance.

In practice a table lookup with Z index fits the solution the best. The processor is a mega328, I have to use the mega328 to get enough sram for the note state buffers.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There is one more interesting variation of the substraction version:
First substract something like 48 or 36 until going negative, and than add 12 until going positive again. This should be slightly faster with only one more word.

mod12a: 
     subi r16,48      ; number in r16 
     bcc  mod12a          
mod12b: 
     subi r16,256-12  ; add 12 
     bcs  mod12b          
  ret            ;remainder in r16

Still the multiply and adjust version is likely still faster - though using more registers, which may need extra operations.

There is also a variation of westfw's way: Its faster to just check Bit instead of values like 96. So add something like

  sbrc R16,7
  subi R16,120

in front of the substraction code. Which is the ASM version of
IF n>=128 n-=120

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Tips to speed it up:
place lookup on pages.
if you can't get index directly to ZL then try to place
it as a low byte of a register pair where the high byte is fixed to the page number (low registers if possible) so Z can be loaded with a movw instruction.
If Z has to be stored, store it with movw to a low register pair.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
uint_8 mod12[64] = {0,0,0,12,12,12,24,24,24,36,36,36,48,48,48,60,60,60,72,72,72,84,84,84,96,96,96,108,108,108,120,120,120,132,132,132,144,144,144,156,156,156,168,168,168,180,180,180,192,192,192,204,204,204,216,216,216,228,228,228,240,240,240,252};

//Demonstration of mod12 lut.
for (uint16_t i = 0; i < 256; i++) {
	uint8_t value = i;
	uint8_t result = value - mod12[value >> 2];
	printf("mod(%d,12)=%d\n",value,result);
}

There's a general trick for fixed mod that we used back in the '80s.

Given the number (here it is 12) and assuming a maximum 8 bits, we can use a simple LUT.  The fastest (but most space hungry) is to use a 256 byte LUT (also necessary for odd number mods).

So, the result is n -= lut12[n];   

Space saving can be done by right-shifting out the prime factors of 2.  12 = 2*2*3 which means we can shift n right 2 (the number of 2's in the prime factorisation), and that reduces the LUT to 256/4 = 64 bytes.

That gives us n -= lut12[n>>2]; 

Each value in the LUT is a multiple of 12, configured so that the LUT value at index 'i'  is always the next multiple up to the LUT index (or if right-shift optimised, the LUT index * 2^ shifts).

So lut12 would be [0,0,0,12,12,12,24,24,24,36,36,36,48,48,48,60,60,60,72,72,72,84,84,84,96,96,96,108,108,108,120,120,120,132,132,132,144,144,144,156,156,156,168,168,168,180,180,180,192,192,192,204,204,204,216,216,216,228,228,228,240,240,240,252]

For example, set n to 59. n >> 2 = 14. lut12[14]=48.  59-48 = 11.

Obviously if the mod constant is a power of 2, one doesn't need a LUT, one just ANDS out the significant bits. (e.g. for n % 32, we do n &= 0x1F; //0x1F = 32-1.)

The LUT method works for all microprocessors - and generally equates to 2-4 instructions. Because there are no jumps or conditionals, the clock time is constant, which is often essential when driving hardware. 

 

 

 

Last Edited: Wed. Jan 27, 2016 - 12:57 PM