## Fastest 16-bit x 16-bit unsigned integer division algorithm for ATMEGA1284?

31 posts / 0 new
Author
Message

I am trying to create an optimized 16-bit division algorithm for the AVR ATMEGA1284.  The goal is to reduce the number of clock cycles as much as possible.

AVR INSTRUCTION SET MANUAL:
https://ww1.microchip.com/downlo...

AVR200: Multiply and Divide Routines:
https://ww1.microchip.com/downlo...

A standard shift and subtract type division algorithm suggested by Atmel/Microchip takes between 173 clock cycles and 243 clock cycles, depending on if you unroll the loop or not.

What I have so far takes a maximum of 68 clock cycles.  I ran an exhaustive test 16-hour test proving that the algorithm returns the correct result for all 2^32 combinations of inputs and outputs.  So I am not looking for any validation that the algorithm returns the correct results.  I am looking for ways to reduce either the code size, lookup table size, or number of clock cycles.

The constraints are as follows.

- The dividend is an unsigned 16-bit number passed into the algorithm in a pair of 8-bit registers.
- The divisor is an unsigned 16-bit number passed into the algorithm in a pair of 8-bit registers.
- The algorithm returns the quotient in a pair of 8-bit registers.
- The algorithm also returns the remainder in a pair of 8-bit registers.
- The algorithm code (plus any lookup tables) must consume less than 5K bytes of flash memory.
- The algorithm may return any values for division by 0.

Here is what I have so far.

```    .align 256
;Recipricol table #1, high byte.
;R1H_TBL[x] = min( high(2^16/x) / 256 , 255)
R1H_TBL:
.db 0xFF, 0xFF, 0x80, 0x55, 0x40, 0x33, 0x2A, 0x24, 0x20, 0x1C, 0x19, 0x17, 0x15, 0x13, 0x12, 0x11
.db 0x10, 0x0F, 0x0E, 0x0D, 0x0C, 0x0C, 0x0B, 0x0B, 0x0A, 0x0A, 0x09, 0x09, 0x09, 0x08, 0x08, 0x08
.db 0x08, 0x07, 0x07, 0x07, 0x07, 0x06, 0x06, 0x06, 0x06, 0x06, 0x06, 0x05, 0x05, 0x05, 0x05, 0x05
.db 0x05, 0x05, 0x05, 0x05, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04
.db 0x04, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03
.db 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02
.db 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02
.db 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02
.db 0x02, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
;Recipricol table #1, low byte.
;R1L_TBL[x] = min( low(2^16/x) mod 256 , 255)
R1L_TBL:
.db 0xFF, 0xFF, 0x00, 0x55, 0x00, 0x33, 0xAA, 0x92, 0x00, 0x71, 0x99, 0x45, 0x55, 0xB1, 0x49, 0x11
.db 0x00, 0x0F, 0x38, 0x79, 0xCC, 0x30, 0xA2, 0x21, 0xAA, 0x3D, 0xD8, 0x7B, 0x24, 0xD3, 0x88, 0x42
.db 0x00, 0xC1, 0x87, 0x50, 0x1C, 0xEB, 0xBC, 0x90, 0x66, 0x3E, 0x18, 0xF4, 0xD1, 0xB0, 0x90, 0x72
.db 0x55, 0x39, 0x1E, 0x05, 0xEC, 0xD4, 0xBD, 0xA7, 0x92, 0x7D, 0x69, 0x56, 0x44, 0x32, 0x21, 0x10
.db 0x00, 0xF0, 0xE0, 0xD2, 0xC3, 0xB5, 0xA8, 0x9B, 0x8E, 0x81, 0x75, 0x69, 0x5E, 0x53, 0x48, 0x3D
.db 0x33, 0x29, 0x1F, 0x15, 0x0C, 0x03, 0xFA, 0xF1, 0xE8, 0xE0, 0xD8, 0xD0, 0xC8, 0xC0, 0xB9, 0xB1
.db 0xAA, 0xA3, 0x9C, 0x95, 0x8F, 0x88, 0x82, 0x7C, 0x76, 0x70, 0x6A, 0x64, 0x5E, 0x59, 0x53, 0x4E
.db 0x49, 0x43, 0x3E, 0x39, 0x34, 0x30, 0x2B, 0x26, 0x22, 0x1D, 0x19, 0x14, 0x10, 0x0C, 0x08, 0x04
.db 0x00, 0xFC, 0xF8, 0xF4, 0xF0, 0xEC, 0xE9, 0xE5, 0xE1, 0xDE, 0xDA, 0xD7, 0xD4, 0xD0, 0xCD, 0xCA
.db 0xC7, 0xC3, 0xC0, 0xBD, 0xBA, 0xB7, 0xB4, 0xB2, 0xAF, 0xAC, 0xA9, 0xA6, 0xA4, 0xA1, 0x9E, 0x9C
.db 0x99, 0x97, 0x94, 0x92, 0x8F, 0x8D, 0x8A, 0x88, 0x86, 0x83, 0x81, 0x7F, 0x7D, 0x7A, 0x78, 0x76
.db 0x74, 0x72, 0x70, 0x6E, 0x6C, 0x6A, 0x68, 0x66, 0x64, 0x62, 0x60, 0x5E, 0x5C, 0x5A, 0x58, 0x57
.db 0x55, 0x53, 0x51, 0x50, 0x4E, 0x4C, 0x4A, 0x49, 0x47, 0x46, 0x44, 0x42, 0x41, 0x3F, 0x3E, 0x3C
.db 0x3B, 0x39, 0x38, 0x36, 0x35, 0x33, 0x32, 0x30, 0x2F, 0x2E, 0x2C, 0x2B, 0x29, 0x28, 0x27, 0x25
.db 0x24, 0x23, 0x21, 0x20, 0x1F, 0x1E, 0x1C, 0x1B, 0x1A, 0x19, 0x18, 0x16, 0x15, 0x14, 0x13, 0x12
.db 0x11, 0x0F, 0x0E, 0x0D, 0x0C, 0x0B, 0x0A, 0x09, 0x08, 0x07, 0x06, 0x05, 0x04, 0x03, 0x02, 0x01
;Recipricol table #2
;R2_TBL[x] = min( 2^16/(x+256), 255)
R2_TBL:
.db 0xFF, 0xFF, 0xFE, 0xFD, 0xFC, 0xFB, 0xFA, 0xF9, 0xF8, 0xF7, 0xF6, 0xF5, 0xF4, 0xF3, 0xF2, 0xF1
.db 0xF0, 0xF0, 0xEF, 0xEE, 0xED, 0xEC, 0xEB, 0xEA, 0xEA, 0xE9, 0xE8, 0xE7, 0xE6, 0xE5, 0xE5, 0xE4
.db 0xE3, 0xE2, 0xE1, 0xE1, 0xE0, 0xDF, 0xDE, 0xDE, 0xDD, 0xDC, 0xDB, 0xDB, 0xDA, 0xD9, 0xD9, 0xD8
.db 0xD7, 0xD6, 0xD6, 0xD5, 0xD4, 0xD4, 0xD3, 0xD2, 0xD2, 0xD1, 0xD0, 0xD0, 0xCF, 0xCE, 0xCE, 0xCD
.db 0xCC, 0xCC, 0xCB, 0xCA, 0xCA, 0xC9, 0xC9, 0xC8, 0xC7, 0xC7, 0xC6, 0xC5, 0xC5, 0xC4, 0xC4, 0xC3
.db 0xC3, 0xC2, 0xC1, 0xC1, 0xC0, 0xC0, 0xBF, 0xBF, 0xBE, 0xBD, 0xBD, 0xBC, 0xBC, 0xBB, 0xBB, 0xBA
.db 0xBA, 0xB9, 0xB9, 0xB8, 0xB8, 0xB7, 0xB7, 0xB6, 0xB6, 0xB5, 0xB5, 0xB4, 0xB4, 0xB3, 0xB3, 0xB2
.db 0xB2, 0xB1, 0xB1, 0xB0, 0xB0, 0xAF, 0xAF, 0xAE, 0xAE, 0xAD, 0xAD, 0xAC, 0xAC, 0xAC, 0xAB, 0xAB
.db 0xAA, 0xAA, 0xA9, 0xA9, 0xA8, 0xA8, 0xA8, 0xA7, 0xA7, 0xA6, 0xA6, 0xA5, 0xA5, 0xA5, 0xA4, 0xA4
.db 0xA3, 0xA3, 0xA3, 0xA2, 0xA2, 0xA1, 0xA1, 0xA1, 0xA0, 0xA0, 0x9F, 0x9F, 0x9F, 0x9E, 0x9E, 0x9D
.db 0x9D, 0x9D, 0x9C, 0x9C, 0x9C, 0x9B, 0x9B, 0x9A, 0x9A, 0x9A, 0x99, 0x99, 0x99, 0x98, 0x98, 0x98
.db 0x97, 0x97, 0x97, 0x96, 0x96, 0x95, 0x95, 0x95, 0x94, 0x94, 0x94, 0x93, 0x93, 0x93, 0x92, 0x92
.db 0x92, 0x91, 0x91, 0x91, 0x90, 0x90, 0x90, 0x90, 0x8F, 0x8F, 0x8F, 0x8E, 0x8E, 0x8E, 0x8D, 0x8D
.db 0x8D, 0x8C, 0x8C, 0x8C, 0x8C, 0x8B, 0x8B, 0x8B, 0x8A, 0x8A, 0x8A, 0x89, 0x89, 0x89, 0x89, 0x88
.db 0x88, 0x88, 0x87, 0x87, 0x87, 0x87, 0x86, 0x86, 0x86, 0x86, 0x85, 0x85, 0x85, 0x84, 0x84, 0x84
.db 0x84, 0x83, 0x83, 0x83, 0x83, 0x82, 0x82, 0x82, 0x82, 0x81, 0x81, 0x81, 0x81, 0x80, 0x80, 0x80
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;ARGUMENTS:  r16, r17, r18, r19
;  r16:r17 = N (numerator)
;  r18:r19 = D (divisor)
;RETURNS:    r20, r21
;  r20:r21 (quotient)
;  r22:r23 (remainder)
;
;DESCRIPTION:  divides an unsigned 16 bit number N by unsigned 16 bit divisor D
;  Typical run time is 57 to 68 clock cycles.
;
;ALGORITHM OVERVIEW
;
;RZERO = 0;
;if(D < 256){
;  N2 = (N * ((R1H_TBL[D] << 8) + R1L_TBL[D])) >> 16;
;  P  = N2 * D
;}else{
;  D1 = (R1H_TBL[D] * D) >> 8
;  N1 = (R1H_TBL[D] * N) >> 8
;  if(D1 < 256){
;    N2 = N1 >> 8;
;  }else{
;    N2 = N1 * R2_TBL[D1 & 0xFF];
;  }
;  P = N2 * D;
;  if(P > 65535){
;    N2 = N2 - 1    ;//Decrement quotient
;    N1 = N2 - P + D;//Calculate remainder
;    return;//return quotient in N2, remainder in N1
;  }
;}
;N1 = N - P;
;if(P > N){
;  N2 = N2 - 1;//decrease quotient
;  N1 = N1 + D;//increase reamainder
;  return;//return quotient in N2, remainder in N1
;}
;if(N1 > D){
;  N2 = N2 + 1;
;  N1 = N1 - D;
;  return;//return quotient in N2, remainder in N1
;}
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
.def NH    = r16 .def NL    = r17
.def DH    = r18 .def DL    = r19
.def N2H   = r20 .def N2L   = r21
.def N1H   = r22 .def N1L   = r23
.def PRODL = r0  .def PRODH = r1
.def PH    = r2  .def PL    = r3
.def D1H   = r4  .def D1L   = r5
.def RZERO = r6
.def Rx    = r7

idivu_16x16:
clr RZERO                 ;1
;Check that DH is not zero
tst DH                    ;1
brne idivu_16x16_dhne   ;2
;code for D < 256
idivu_16x8:
;lookup low byte of recipricol into P.
;where P = min(2^16 / D,2^16-1)
mov zl, DL               ;1
ldi zh, high(R1L_TBL*2)  ;1
lpm PL, Z                ;3
ldi zh, high(R1H_TBL*2)  ;1
lpm PH, Z                ;3
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;calculate N2 = (P * N) >> 16
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;     NH:NL
;  X  RH:RL
;------------------------------------------
;   N2H    |   N2L    |  N1H     | dropped
;----------+----------+----------+---------
;          |          | H(PL*NL) | L(PL*NL)
;          | H(PL*NH) | L(PL*NH) |
;          | H(PH*NL) | L(PH*NL) |
; H(PH*NH) | L(PH*NH) |          |
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
mul NL , PL     ;2  PL*NL
mov N1H, PRODH  ;1  N1H <= H(PL*NL)
mul NH , PH     ;2  PH*NH
mov N2H, PRODH  ;1  N2H <= H(PH*NH)
mov N2L, PRODL  ;1  N2L <= L(PH*NH)
mul NH , PL        ;2  PL*NH
add N1H, PRODL    ;1  N1H <= H(PL*NL) + L(PL*NH)
adc N2L, PRODH    ;1  N2L <= L(PH*NH) + H(PL*NH)
adc N2H, RZERO  ;1  propagate carry to N2H
mul NL , PH     ;2  PH*NL
add N1H, PRODL  ;1  N1H <= H(PL*NL) + L(PL*NH) + L(PH*NL)
adc N2L, PRODH  ;1  N2L <= H(PH*NL) + L(PH*NH) + H(PL*NH)
adc N2H, RZERO  ;1  propagate carry to N2H
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;calculate P = N2 * DL ,note DH=0
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
mul N2L, DL              ;2
mov PL, PRODL             ;1
mov PH, PRODH            ;1
mul N2H, DL                 ;2
;code for D >= 256
idivu_16x16_dhne:
;Lookup Rx = min(256 / DH, 255)
mov zl, DH               ;1
ldi zh, high(R1H_TBL*2)  ;1
lpm Rx, Z                ;3
;D1 = (D * Rx) >> 8
mul Rx , DH              ;2
mov D1L, PRODL           ;1
mov D1H, PRODH           ;1
mul Rx , DL              ;2
;N1 = (D * Rx) >> 8
mul Rx , NH              ;2
mov N1L, PRODL           ;1
mov N1H, PRODH           ;1
mul Rx , NL              ;2
;if D1H = 0 then use Rx = 256, otherwise use table
tst D1H                  ;1
brne idivu_16x16_dxhne ;2
mov N2L, N1H             ;1
clr N2H                  ;1
rjmp idivu_16x16_checkn;2
idivu_16x16_dxhne:
;Lookup Rx = (2 ^ 16) \ (256 + D1H)
mov zl, D1L              ;1
ldi zh, high(R2_TBL*2)   ;1
lpm Rx, Z                ;3
;N2 = (N1 * R2) >> 16
mul Rx  , N1H            ;2
mov PL  , PRODL          ;1
mov N2L , PRODH          ;1
mul Rx , N1L             ;2
clr N2H                  ;1
idivu_16x16_checkn:
;Check result (it may be off by +/- 1)
;P = N2 * D
;NOTE:  N2 <=255 so NxH = 0, also P < 2^16 so we can discard upper byte of DH * NxL
mul DL , N2L             ;2
mov PL, PRODL            ;1
mov PH, PRODH            ;1
mul DH , N2L             ;2
;if multiply overflowed then...
;decrement quotient
;calculate remainder as N - P + D
subi N2L, 0x01           ;1
sbci N2H, 0x00           ;1
mov N1L, NL              ;1
mov N1H, NH              ;1
sub N1L, PL              ;1
sbc N1H, PH              ;1
rjmp idivu_16x16_end   ;2
;Adjust result up or down by 1 if needed.
;Add -P to N, with result in P
mov N1L, NL              ;1
mov N1H, NH              ;1
sub N1L, PL              ;1
sbc N1H, PH              ;1
brsh idivu_16x16_pltn  ;2
idivu_16x16_decn2:
;if P > N then decrement quotient, add to remainder
subi N2L, 0x01           ;1
sbci N2H, 0x00           ;1
rjmp idivu_16x16_end   ;2
idivu_16x16_pltn:
;test remainder to D
cp  N1L, DL              ;1
cpc N1H, DH              ;1
;if remainder < D then goto end
brlo idivu_16x16_end   ;2
;if remainder >= D then increment quotient, reduce remainder
subi N2L, 0xFF           ;1
sbci N2H, 0xFF           ;1
sub N1L, DL              ;1
sbc N1H, DH              ;1
idivu_16x16_end:
ret
.undef NH    .undef NL
.undef DH    .undef DL
.undef N2H   .undef N2L
.undef N1H   .undef N1L
.undef PRODL .undef PRODH
.undef PH    .undef PL
.undef D1H   .undef D1L
.undef RZERO
.undef Rx```

NOTE:  In my application I am using the divide algorithm to find the intersection points of lines as part of a graphics drawing algorithm.  Actually drawing the lines doesn't require division, but quickly determining their intersection points does.  The lines are generated based on data that comes in from outside the device.  Therefore I don't know the divisors ahead of time and can't just multiply by a constant reciprocal or something like that.  I really do need a general purpose divide algorithm.

If you look for worst case, then don't have code like this :

```        tst DH                    ;1
brne idivu_16x16_dhne   ;2
```

That help AVG speed, but not max time

The last 7 rows of R1H_TBL are a single value. That's a candidate for replacing with a conditional check. You save 112 bytes on the lookup table, but add ~16 bytes of code and ~6 cycles for each lookup.

What would you suggest I do instead?

I am using a 16 bit reciprocal table and a modified 8 bit reciprocal table.

For a number less than 256 I can simply use the number itself as an index into a 16-bit reciprocal table and then multiply to get within +/-1 of the correct answer.

For numbers 256 or greater I can't do that.  To directly index a 16-bit number I would need a table with 64K entries.  Secondly, the table would need to have more than 16-bits per entry to have enough resolution.  The combination of those two things would likely make the table take over 128K of flash (probably like 256K), which is way too big.

So for numbers > 255 the solution I came up with was to just use the upper byte as the index to the first reciprocal table.  The result is a number in the range 0b1_0000_0000 to 0b1_1111_1111.  Since I didn't use the lower byte as part of the indexing its not going to be the correct answer, but at least I didn't need a 65K entry table.  Having that partial result I can then multiply by a second modified reciprocal table just based on the lower byte of the normalized result.  This gets me within +/-1 of the correct answer.

But since I have to treat the two cases separately I need to test the upper byte of the divisor for 0 to find out which case I am in.

If you have any better way of doing it I am open to suggestions.

Last Edited: Thu. Feb 11, 2021 - 11:43 PM

That is an interesting point.

One thing to note is that the tables are aligned on a 256 byte boundary.  That allows me to directly load the table address into ZH and the index into ZL.  If they were not aligned I would have to do an add with carry to create the pointer address, and hat would cost an extra 3 cycles for each lookup (so 9 total).

The way the tables are positioned now, eliminating those rows doesn't buy me any space because the next table must be 256 byte aligned (so there would just be a 112 gap).  I might be able to find something to put there, but probably not.

I could make that work if I moved R1H_TBL to be the third table, then the algorithm code will naturally fill into the unused 112 bytes.

112 bytes for 6 clock cycles is an interesting trade.  But at this point I would rather keep the 112 bytes rather than add the cycles.

```N2 = (N * ((R1H_TBL[D] << 8) + R1L_TBL[D])) >> 16;
```

becomes

```if(D>=144) {
N2 = (N * ((1 << 8) + R1L_TBL[D])) >> 16;
} else {
N2 = (N * ((R1H_TBL[D] << 8) + R1L_TBL[D])) >> 16;
}
```

and

```D1 = (R1H_TBL[D] * D) >> 8
N1 = (R1H_TBL[D] * N) >> 8
```

becomes

```if(D>=144) {
D1 = D >> 8
N1 = N >> 8
} else {
D1 = (R1H_TBL[D] * D) >> 8
N1 = (R1H_TBL[D] * N) >> 8
}
```

can't you define up your registers to do some word operations?

``` mov N1L, PRODL           ;1
mov N1H, PRODH           ;1```

like   movw  N1H:N1L, PRODH:PRODL    ;1 cycle to move 16 bits   prod-->N1

there's also ADIW , and SBIW    ...maybe some of these could help save a step here & there,...I think your chip has these opcodes

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Fri. Feb 12, 2021 - 12:13 AM

According to page 22 of the instruction set reference, ADIW and SBIW use one op code but take 2 cycles.  So i could possibly save 2 bytes of op-code space, but not any clock cycles.

There is also the restriction that I can only use these on registers R24 to R30 (so only the index registers).  If I map the N and D arguments into X and Y that could work.

I previously looked it , since it is not always the same everywhere (simply looked at atmega1284) , and it showed 1 cycle...now maybe there is a difference in all the crazy suffixex? none/p/pa/etc)..whhat sheet you use?

There's also errata & mistakes...so then you have to wonder  if any of them are really correct, like the wrong table inserted?--maybe worth an actual test  (try 10 million divides &  get out the stopwatch).

There is also the restriction that I can only use these on registers R24 to R30 (so only the index registers).  If I map the N and D arguments into X and Y that could work.

Generically, at least , it works over a wide range...somewhere they may say otherwise (or lie) about your chip

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Fri. Feb 12, 2021 - 12:38 AM

I had seen that it was 2 cycles in the "AVR Instruction Set Manual".  AVR Instruction Set Manual (microchip.com).  And so that's what I believed.

Now that you have pointed it out, I am inclined to believe the datasheet over the more generic instruction set manual.  I will have to try it in the simulator.

Both the datasheet and Instruction set manual seem to agree that MOVW can be used on any register pair.  Its just that the word sized add/subtract instructions can only be used on X, Y, Z according to the Instruction Set Manual.

Its just that the word sized add/subtract instructions can only be used on X, Y, Z according to the Instruction Set Manual.

Yep, there is a discontinuity between movw & the others--I am often reminded of that when I hit F7 (build).

There are prob a few varieties of manuals, since for many years it was very clear that movw took only 1 cycle  (I pulled mine from studio 4)...what starts out uniform becomes blurrrry over time.

Actually it shows  SBIW & ADIW always took 2 cycles...so only  MOVW can be  a real time savings, the others can save a smidge of code space.

see, you just need to look at this one:

http://jasongu.org/3204/midterm-...

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Fri. Feb 12, 2021 - 01:30 AM

in #2 I was thinking of using CPSE (you just clear a register so you for sure have a zero reg).

(It would have been more use full if it had a CPSNE but it don't.)

About using the MOVW instruction it is very use full but have the limitation that it have to have registers at the same word place, so for multi byte multiplications it's at hit and miss where to place the result (first byte in a even or odd register).

I have seen your way of doing this div, in a databook for a old DSP perhaps there is some hints there. (I think it was for a ADSP2100 but not sure perhaps TI 310xx or 320xx)

I saw 5 places/pairs where you could use movw...just make sure the high bytes are the odd bytes of all pairs.  It doesn't appear autoincrementing is used, so their locations (other than r1:r0)  were arbitrary in the first place

mov D1L, PRODL           ;1
mov D1H, PRODH           ;1

mov N1L, PRODL           ;1
mov N1H, PRODH           ;1

mov PL, PRODL            ;1
mov PH, PRODH            ;1

mov N1L, NL              ;1
mov N1H, NH              ;1

mov N1L, NL              ;1
mov N1H, NH              ;1

also, you have a ton of ram space---why not move your tables there & use LD instead of LPM...you will still have a lot of ram left over (more left over than many AVR's have in the first place--- 1 or 2 k)  Maybe your graphics use it  all up?

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Fri. Feb 12, 2021 - 09:24 AM

I should add that if you want to use movw then you probably should flip your high and low bytes. they should be in same order as r1:r0 .

Those are good suggestions.  Between using movw and using a RAM table I can probably shave off at least 6 clock cycles.  I will have to code it and see.

Also, if you could interweave (combine) your tables to use auto increment, you might have a savings..HOWEVER, I could not see a way to do this, without taking more cycles  ...but just wanted to seed your mind.

For example, here you switch tables, but if the + mode could be used that would be a savings...but would need a different (an prob more lengthy lookup calc)...if you could find a just-as-short lookup calc with interweave, you might eeek out a savings.

```  mov zl, DL               ;1
ldi zh, high(R1L_TBL*2)  ;1
lpm PL, Z                ;3
ldi zh, high(R1H_TBL*2)  ;1
lpm PH, Z                ;3```

lpm PL Z+   ...points to next item to get, no need to change ZH, if data is adjacent in same table

=================

also  you do

setup a mul#1--> prodh:l into hi:lo bytes

then another mul#2 & add prodh into low byte & also the carry byte (adc xxxx , rzero)

if you do the sequence swapped:

then you do not need  to  adc zero into the high byte ...this would work (be possible/beneficial) if you can assure the original high byte is zero, or has already been cleared in some step.

You do test for zero as part of your looping, but it is not the needed byte ...maybe look into this idea some more.

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Sat. Feb 13, 2021 - 06:59 PM

I had originally had the tables with the high and low bytes interleaved in the same table, but ultimately it was faster to separate the high and low bytes into different tables.

Looking at it this way, any kind of calculation on the index can't take less than 1 clock cycles.

If the high and low bytes are in separate tables, then I can directly use the index with no calculation.  To get the second byte all I need to do is load ZH with the other table address (which is a constant).  This only takes one cycle.  So that method is just as good as any other method which relies on an address calculation.

In reality the address calculation is to (one way or another) multiply the index by 2 and add it to the table base address.  If done using a left shift I would need to shift the lower byte and then add the carry flag to ZH (so 2 cycles total).  If I did it using the multiply instructions I would need to load a constant, do the multiply, and then do a 16 bit add of the result to the table base address.  That way is likely to take at least 4 cycles.

So it appears what I already have is most efficient in that respect.

yes just keep the tables as you have them, it's the faster way.

(The first AVR's didn't have a HW mul so I made a fast LUT version where the LUT hold the sqr of 256 numbers and I placed them the same way.)

Note that if/when you switch from  LPM , to RAM (LD) lookup, you have another advantage..lpm can only use Z, so you are always swapping around the high table pointer for different table access.

LD can be based on xh:xl, yh:yl, or zh:zl.....so you may be able to reserve one for each table, to do less swapping (at least the high byte)--the low byte used to say exactly where you are in any table.

Also,  about movw...when you initially set up DH:DL  (in the  caller) , you can possibly use movw there as well. Every step saved counts.

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

This is correct but since OP do this:

`.align 256`

The pointers come easy.

The busy place is r1:r0 and nothing can be done about that.

Last Edited: Sun. Feb 14, 2021 - 08:47 AM

I was able to shave off 5 cycles by copying the tables to RAM and using movw in a few places (so I guess we are down to 63 cycles so far).

~10% is nothing to sneeze at...post your latest, maybe we will see new squeezes.

Did you use the fact that LD allows you to use x,y,or z...thus constantly maintain some settings (such as the high or low pointer byte) & not having to keep setting them each time.  The 3 high bytes of the 3 table addresses should stay steady (assuming you  line up  the table locations properly, 0xnn00 to 0xnnFF)-- so you can set  XH nn, YH mm , ZH kk once and forget about those.

If you put a blank line after each rjmp, your posting will be MUCH easier to read

also, dedicate r25:r24 as N2H:N2l

then you can use  & sbiW  adiw ...won't save any speed , but save 3 bytes of program:

subi N2L, 0x01           ;1
sbci N2H, 0x00           ;1
subi N2L, 0x01           ;1
sbci N2H, 0x00           ;1

subi N2L, 0xFF           ;1
sbci N2H, 0xFF           ;1

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Mon. Feb 15, 2021 - 06:31 AM

One of the biggest benefits of LUT based algorithms is that the code itself is relative small (compared to unrolled versions) so you can inline the code itself so you save a call and ret that is 7 clk.

I just got an idea, not checked at all.

Perhaps there could be a way to solve the problem (with LUTs) by first find the distance between the two numbers first 1. (then perhaps the rest perhaps could be 8bit, with split(s) between big and small quotients )

Here are the updates after adding the RAM based tables and using MOVW where appropriate.

The high/low positions of any multi-register arguments were swapped to facilitate use of MOVW.

The max run time is now 62 clock cycles.

A preprocessor symbol (RAM_DIVIDE_TABLE) was added to select use of tables in RAM or in ROM.

There is an init_math routine that gets called once at startup to copy the tables to RAM.

I re-ran my exhaustive 16-hour test (on the actual chip) to make sure every one of the 2^32 combinations of inputs yielded the correct output when compared to a standard shift and subtract type divide routine.  There were no failures.

The tables now look like this...

```#define RAM_DIVIDE_TABLE
#ifdef RAM_DIVIDE_TABLE
.dseg
.align 256
R1H_TBL: .byte 256
R1L_TBL: .byte 256
R2_TBL:  .byte 256
.cseg
init_math:
;Copy ROM divide tables to RAM
clr r1;counter
ldi zl, low(R1H_TBL_ROM*2)  ;1
ldi zh, high(R1H_TBL_ROM*2) ;1
ldi yh, low(R1H_TBL)
ldi yh, high(R1H_TBL)
init_math_loop_1:
lpm r0, Z+                  ;3
st Y+, r0                  ;2
inc r1                      ;1
brne init_math_loop_1       ;2
init_math_loop_2:
lpm r0, Z+                  ;3
st Y+, r0                  ;2
inc r1                      ;1
brne init_math_loop_2       ;2
init_math_loop_3:
lpm r0, Z+                  ;3
st Y+, r0                  ;2
inc r1                      ;1
brne init_math_loop_3       ;2
ret
#else
.cseg
.align 256
#endif
;Recipricol table #1, high byte.
;R1H_TBL[x] = min( high(2^16/x) / 256 , 255)
R1H_TBL_ROM:
.db 0xFF, 0xFF, 0x80, 0x55, 0x40, 0x33, 0x2A, 0x24, 0x20, 0x1C, 0x19, 0x17, 0x15, 0x13, 0x12, 0x11
.db 0x10, 0x0F, 0x0E, 0x0D, 0x0C, 0x0C, 0x0B, 0x0B, 0x0A, 0x0A, 0x09, 0x09, 0x09, 0x08, 0x08, 0x08
.db 0x08, 0x07, 0x07, 0x07, 0x07, 0x06, 0x06, 0x06, 0x06, 0x06, 0x06, 0x05, 0x05, 0x05, 0x05, 0x05
.db 0x05, 0x05, 0x05, 0x05, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04, 0x04
.db 0x04, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x03
.db 0x03, 0x03, 0x03, 0x03, 0x03, 0x03, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02
.db 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02
.db 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02, 0x02
.db 0x02, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
.db 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01, 0x01
;Recipricol table #1, low byte.
;R1L_TBL[x] = min( low(2^16/x) mod 256 , 255)
R1L_TBL_ROM:
.db 0xFF, 0xFF, 0x00, 0x55, 0x00, 0x33, 0xAA, 0x92, 0x00, 0x71, 0x99, 0x45, 0x55, 0xB1, 0x49, 0x11
.db 0x00, 0x0F, 0x38, 0x79, 0xCC, 0x30, 0xA2, 0x21, 0xAA, 0x3D, 0xD8, 0x7B, 0x24, 0xD3, 0x88, 0x42
.db 0x00, 0xC1, 0x87, 0x50, 0x1C, 0xEB, 0xBC, 0x90, 0x66, 0x3E, 0x18, 0xF4, 0xD1, 0xB0, 0x90, 0x72
.db 0x55, 0x39, 0x1E, 0x05, 0xEC, 0xD4, 0xBD, 0xA7, 0x92, 0x7D, 0x69, 0x56, 0x44, 0x32, 0x21, 0x10
.db 0x00, 0xF0, 0xE0, 0xD2, 0xC3, 0xB5, 0xA8, 0x9B, 0x8E, 0x81, 0x75, 0x69, 0x5E, 0x53, 0x48, 0x3D
.db 0x33, 0x29, 0x1F, 0x15, 0x0C, 0x03, 0xFA, 0xF1, 0xE8, 0xE0, 0xD8, 0xD0, 0xC8, 0xC0, 0xB9, 0xB1
.db 0xAA, 0xA3, 0x9C, 0x95, 0x8F, 0x88, 0x82, 0x7C, 0x76, 0x70, 0x6A, 0x64, 0x5E, 0x59, 0x53, 0x4E
.db 0x49, 0x43, 0x3E, 0x39, 0x34, 0x30, 0x2B, 0x26, 0x22, 0x1D, 0x19, 0x14, 0x10, 0x0C, 0x08, 0x04
.db 0x00, 0xFC, 0xF8, 0xF4, 0xF0, 0xEC, 0xE9, 0xE5, 0xE1, 0xDE, 0xDA, 0xD7, 0xD4, 0xD0, 0xCD, 0xCA
.db 0xC7, 0xC3, 0xC0, 0xBD, 0xBA, 0xB7, 0xB4, 0xB2, 0xAF, 0xAC, 0xA9, 0xA6, 0xA4, 0xA1, 0x9E, 0x9C
.db 0x99, 0x97, 0x94, 0x92, 0x8F, 0x8D, 0x8A, 0x88, 0x86, 0x83, 0x81, 0x7F, 0x7D, 0x7A, 0x78, 0x76
.db 0x74, 0x72, 0x70, 0x6E, 0x6C, 0x6A, 0x68, 0x66, 0x64, 0x62, 0x60, 0x5E, 0x5C, 0x5A, 0x58, 0x57
.db 0x55, 0x53, 0x51, 0x50, 0x4E, 0x4C, 0x4A, 0x49, 0x47, 0x46, 0x44, 0x42, 0x41, 0x3F, 0x3E, 0x3C
.db 0x3B, 0x39, 0x38, 0x36, 0x35, 0x33, 0x32, 0x30, 0x2F, 0x2E, 0x2C, 0x2B, 0x29, 0x28, 0x27, 0x25
.db 0x24, 0x23, 0x21, 0x20, 0x1F, 0x1E, 0x1C, 0x1B, 0x1A, 0x19, 0x18, 0x16, 0x15, 0x14, 0x13, 0x12
.db 0x11, 0x0F, 0x0E, 0x0D, 0x0C, 0x0B, 0x0A, 0x09, 0x08, 0x07, 0x06, 0x05, 0x04, 0x03, 0x02, 0x01
;Recipricol table #2
;R2_TBL[x] = min( 2^16/(x+256), 255)
R2_TBL_ROM:
.db 0xFF, 0xFF, 0xFE, 0xFD, 0xFC, 0xFB, 0xFA, 0xF9, 0xF8, 0xF7, 0xF6, 0xF5, 0xF4, 0xF3, 0xF2, 0xF1
.db 0xF0, 0xF0, 0xEF, 0xEE, 0xED, 0xEC, 0xEB, 0xEA, 0xEA, 0xE9, 0xE8, 0xE7, 0xE6, 0xE5, 0xE5, 0xE4
.db 0xE3, 0xE2, 0xE1, 0xE1, 0xE0, 0xDF, 0xDE, 0xDE, 0xDD, 0xDC, 0xDB, 0xDB, 0xDA, 0xD9, 0xD9, 0xD8
.db 0xD7, 0xD6, 0xD6, 0xD5, 0xD4, 0xD4, 0xD3, 0xD2, 0xD2, 0xD1, 0xD0, 0xD0, 0xCF, 0xCE, 0xCE, 0xCD
.db 0xCC, 0xCC, 0xCB, 0xCA, 0xCA, 0xC9, 0xC9, 0xC8, 0xC7, 0xC7, 0xC6, 0xC5, 0xC5, 0xC4, 0xC4, 0xC3
.db 0xC3, 0xC2, 0xC1, 0xC1, 0xC0, 0xC0, 0xBF, 0xBF, 0xBE, 0xBD, 0xBD, 0xBC, 0xBC, 0xBB, 0xBB, 0xBA
.db 0xBA, 0xB9, 0xB9, 0xB8, 0xB8, 0xB7, 0xB7, 0xB6, 0xB6, 0xB5, 0xB5, 0xB4, 0xB4, 0xB3, 0xB3, 0xB2
.db 0xB2, 0xB1, 0xB1, 0xB0, 0xB0, 0xAF, 0xAF, 0xAE, 0xAE, 0xAD, 0xAD, 0xAC, 0xAC, 0xAC, 0xAB, 0xAB
.db 0xAA, 0xAA, 0xA9, 0xA9, 0xA8, 0xA8, 0xA8, 0xA7, 0xA7, 0xA6, 0xA6, 0xA5, 0xA5, 0xA5, 0xA4, 0xA4
.db 0xA3, 0xA3, 0xA3, 0xA2, 0xA2, 0xA1, 0xA1, 0xA1, 0xA0, 0xA0, 0x9F, 0x9F, 0x9F, 0x9E, 0x9E, 0x9D
.db 0x9D, 0x9D, 0x9C, 0x9C, 0x9C, 0x9B, 0x9B, 0x9A, 0x9A, 0x9A, 0x99, 0x99, 0x99, 0x98, 0x98, 0x98
.db 0x97, 0x97, 0x97, 0x96, 0x96, 0x95, 0x95, 0x95, 0x94, 0x94, 0x94, 0x93, 0x93, 0x93, 0x92, 0x92
.db 0x92, 0x91, 0x91, 0x91, 0x90, 0x90, 0x90, 0x90, 0x8F, 0x8F, 0x8F, 0x8E, 0x8E, 0x8E, 0x8D, 0x8D
.db 0x8D, 0x8C, 0x8C, 0x8C, 0x8C, 0x8B, 0x8B, 0x8B, 0x8A, 0x8A, 0x8A, 0x89, 0x89, 0x89, 0x89, 0x88
.db 0x88, 0x88, 0x87, 0x87, 0x87, 0x87, 0x86, 0x86, 0x86, 0x86, 0x85, 0x85, 0x85, 0x84, 0x84, 0x84
.db 0x84, 0x83, 0x83, 0x83, 0x83, 0x82, 0x82, 0x82, 0x82, 0x81, 0x81, 0x81, 0x81, 0x80, 0x80, 0x80```

The divide routine looks like this...

```;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;ARGUMENTS:  r16, r17, r18, r19
;  r17:r16 = N (numerator)
;  r19:r18 = D (divisor)
;RETURNS:    r20, r21
;  r21:r20 (quotient)
;  r23:r22 (remainder)
;
;DESCRIPTION:  divides an unsigned 16 bit number N by unsigned 16 bit divisor D
;  Max run time is 62 clock cycles.
;
;ALGORITHM OVERVIEW
;
;RZERO = 0;
;if(D < 256){
;  N2 = (N * ((R1H_TBL[D] << 8) + R1L_TBL[D])) >> 16;
;  P  = N2 * D
;}else{
;  D1 = (R1H_TBL[D] * D) >> 8
;  N1 = (R1H_TBL[D] * N) >> 8
;  if(D1 < 256){
;    N2 = N1 >> 8;
;  }else{
;    N2 = N2 * R2_TBL[D1 & 0xFF];
;  }
;  P = N2 * D;
;  if(P > 65535){
;    N2 = N2 - 1    ;//Decrement quotient
;    N1 = N2 - P + D;//Calculate remainder
;    return;//return quotient in N2, remainder in N1
;  }
;}
;N1 = N - P;
;if(P > N){
;  N2 = N2 - 1;//decrease quotient
;  N1 = N1 + D;//increase reamainder
;  return;//return quotient in N2, remainder in N1
;}
;if(N1 > D){
;  N2 = N2 + 1;
;  N1 = N1 - D;
;  return;//return quotient in N2, remainder in N1
;}
;
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
.def NL    = r16 .def NH    = r17 ;numerator
.def DL    = r18 .def DH    = r19 ;divisor
.def N2L   = r20 .def N2H   = r21 ;temp variables, becomes quotient.
.def N1L   = r22 .def N1H   = r23 ;temp variables, becomes remainder.
.def PRODL = r0  .def PRODH = r1  ;hardware multiply product
.def PL    = r2  .def PH    = r3  ;product
.def D1L   = r4  .def D1H   = r5
.def RZERO = r6                   ;zero value
.def Rx    = r7

idivu_16x16:
clr RZERO                 ;1
;Check that DH is not zero
tst DH                    ;1
brne idivu_16x16_dhne   ;2
;code for D < 256
idivu_16x8:
;lookup low byte of recipricol into P.
;where P = min(2^16 / D,2^16-1)
mov zl, DL               ;1
#ifdef RAM_DIVIDE_TABLE
ldi zh, high(R1L_TBL)	 ;1
ld PL, Z	             ;2
ldi zh, high(R1H_TBL)    ;1
ld PH, Z                 ;2
#else
ldi zh, high(R1L_TBL*2)  ;1
lpm PL, Z                ;3
ldi zh, high(R1H_TBL*2)  ;1
lpm PH, Z                ;3
#endif
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;calculate N2 = (P * N) >> 16
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;     NH:NL
;  X  RH:RL
;------------------------------------------
;   N2H    |   N2L    |  N1H     | dropped
;----------+----------+----------+---------
;          |          | H(PL*NL) | L(PL*NL)
;          | H(PL*NH) | L(PL*NH) |
;          | H(PH*NL) | L(PH*NL) |
; H(PH*NH) | L(PH*NH) |          |
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

mul NL , PL     ;1  PL*NL
mov N1H, PRODH  ;1  N1H <= H(PL*NL)
mul NH , PH     ;1  PH*NH
movw N2L, PRODL
mul NH , PL		;1  PL*NH
add N1H, PRODL	;1  N1H <= H(PL*NL) + L(PL*NH)
adc N2L, PRODH	;1  N2L <= L(PH*NH) + H(PL*NH)
adc N2H, RZERO  ;1  propagate carry to N2H
mul NL , PH     ;1  PH*NL
add N1H, PRODL  ;1  N1H <= H(PL*NL) + L(PL*NH) + L(PH*NL)
adc N2L, PRODH  ;1  N2L <= H(PH*NL) + L(PH*NH) + H(PL*NH)
adc N2H, RZERO  ;1  propagate carry to N2H
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;calculate P = N2 * DL ,note DH=0
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
mul N2L, DL              ;1
movw PL, PRODL           ;1
mul N2H, DL		         ;1
;code for D >= 256
idivu_16x16_dhne:
;Lookup Rx = min(256 / DH, 255)
mov zl, DH               ;1
#ifdef RAM_DIVIDE_TABLE
ldi zh, high(R1H_TBL)    ;1
ld Rx, Z                 ;2
#else
ldi zh, high(R1H_TBL*2)  ;1
lpm Rx, Z                ;3
#endif
;D1 = (D * Rx) >> 8
mul Rx , DH              ;1
movw D1L, PRODL          ;1
mul Rx , DL              ;1
;N1 = (D * Rx) >> 8
mul Rx , NH              ;1
movw N1L, PRODL          ;1
mul Rx , NL              ;1
;if D1H = 0 then use Rx = 256, otherwise use table
tst D1H                  ;1
brne idivu_16x16_dxhne ;2

mov N2L, N1H             ;1
clr N2H                  ;1
rjmp idivu_16x16_checkn;2

idivu_16x16_dxhne:
;Lookup Rx = (2 ^ 16) \ (256 + D1H)
mov zl, D1L              ;1
#ifdef RAM_DIVIDE_TABLE
ldi zh, high(R2_TBL)     ;1
ld Rx, Z                 ;2
#else
ldi zh, high(R2_TBL*2)   ;1
lpm Rx, Z                ;3
#endif
;N2 = (N1 * R2) >> 16
mul Rx  , N1H            ;1
mov PL  , PRODL          ;1
mov N2L , PRODH          ;1
mul Rx , N1L             ;1
clr N2H                  ;1

idivu_16x16_checkn:
;Check result (it may be off by +/- 1)
;P = N2 * D
;NOTE:  N2 <=255 so NxH = 0, also P < 2^16 so we can discard upper byte of DH * NxL
mul DL , N2L             ;1
movw PL, PRODL           ;1
mul DH , N2L             ;1

;if multiply overflowed then...
;decrement quotient
;calculate remainder as N - P + D
subi N2L, 0x01           ;1
sbci N2H, 0x00           ;1
movw N1L, NL             ;1
sub N1L, PL              ;1
sbc N1H, PH              ;1
rjmp idivu_16x16_end   ;2

;Adjust result up or down by 1 if needed.
;Add -P to N, with result in P
movw N1L, NL             ;1
sub N1L, PL              ;1
sbc N1H, PH              ;1
brsh idivu_16x16_pltn  ;2

idivu_16x16_decn2:
;if P > N then decrement quotient, add to remainder
subi N2L, 0x01           ;1
sbci N2H, 0x00           ;1
rjmp idivu_16x16_end   ;2

idivu_16x16_pltn:
;test remainder to D
cp  N1L, DL              ;1
cpc N1H, DH              ;1
;if remainder < D then goto end
brlo idivu_16x16_end   ;2

;if remainder >= D then increment quotient, reduce remainder
subi N2L, 0xFF           ;1
sbci N2H, 0xFF           ;1
sub N1L, DL              ;1
sbc N1H, DH              ;1
idivu_16x16_end:
ret
.undef NH    .undef NL
.undef DH    .undef DL
.undef N2H   .undef N2L
.undef N1H   .undef N1L
.undef PRODL .undef PRODH
.undef PH    .undef PL
.undef D1H   .undef D1L
.undef RZERO
.undef Rx  ```

One optimization that could have a big impact on the code run time would be to see if I can eliminate overflow error checking by rounding the table values up or down by one count.  Right now some combinations of table values and inputs can result in overflow, and I need to check for that.  I might try running a variant of this algorithm on a PC and find those cases and see if rounding the table value up or down by one count fixes the problem without breaking anything else.

In any case I am not aware of any 16-bit x 16-bit divide algorithms that will run on this processor in less than 62 cycles, so I think we are doing pretty good so far.

Excellent!!

Take a look here...maybe use X , Y, Z.....the hi bytes can use immediate load & hold through all cycling.

So now additional YH:YL & ZH:ZL get stomped on (some extra cycles to perhaps restore them), but even then, maybe a net savings, since ZH,YH,XH high bytes can be loaded once & held steady over multiple cycles:

original codes   (lpm's in a few places)
mov zl, DL
ldi zh, high(R1L_TBL*2)  ;1
lpm PL, Z                ;3
ldi zh, high(R1H_TBL*2)  ;1
lpm PH, Z                ;3
==============================
mov zl, DH               ;1
ldi zh, high(R1H_TBL*2)  ;1
lpm Rx, Z                ;3
======================================
mov zl, D1L              ;1
ldi zh, high(R2_TBL*2)   ;1
lpm Rx, Z                ;3

ldi zh, high(R1H_TBL*2)  ;1   PRIOR TO ALL LOOPS  & hold throughout
ldi yh, high(R1L_TBL*2)  CHANGED to Y;1  PRIOR TO ALL LOOPS
ldi Zh, high(R2_TBL*2)   CHANGED to x;1  PRIOR TO ALL LOOPS
--------------
mov yl, DL  CHANGED to Y
lpm PL, Y  SAVINGS NO HI BYTE    ;3

lpm PH, Z  SAVINGS ON ZH  ;3
==============================
mov zl, DH                ;1
lpm Rx, Z  SAVINGS ON ZH    ;3
======================================
mov xl, D1L              ;1
lpm Rx, X  changed to X, savings on hi byte  ;3

By the way, do you know what the min & AVG # of cycles is over all billion(s) combos?

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Mon. Feb 22, 2021 - 08:19 AM

Nice work:

Perhaps build the tables with a slow normal algorithm as a part of the init so you don't need to have them in flash at all. (I did that with my MUL for the mega103 but that was also an easy task).

I also have to say that it's not that often you need a good div routine so there not so many that has looked into this. (And if you need it, speed don't normally matter).

I don't think min really matter (I have not counted it) but the use of the small table should faster.

For AVG I guess it would take about 6 hours to find out   ;)  (now when we know that results are correct)

Last Edited: Mon. Feb 22, 2021 - 12:23 PM

I don't think min really matter

Actually it would be really interesting (maybe) to see a histogram...there are only around 70 max possibilities (bins) to count up, and less than that since 2 or 5, etc cycles never happen

How many times did it take 64 cycles, how about 62 cycles?  Let the computer tell us.

AA has a longest possible time of 200 cycles & BB has a longest possible time of 220 cycles...but BB may 99.8% of the time be 10 cycles faster than AA...so maybe BB is "better"...a tough question of statistics.

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

And I make regulators etc. I can't use that everything is fine 99.999% of the time if it crash once a month because 3 worst cases come on top of each other.

I was mainly concerned with worst case run time.  I didn't capture min and average data, but I could do another run an post that.

It may work in some other cases, but in my case I am processing data in a buffer, so having to save/restore a pointer would eat all my time savings in the actual div routine.

I like the idea of building the tables using the slow version of the routine.

The tables themselves are 768 bytes, and a normal divide routine is much smaller than that.  So there is definitely the possibility of a lot of savings.

I will probably keep the flash version of the tables also so that they can be used when the RAM_DIVIDE_TABLE option is turned off.

The table values can be generated from a normal divide routine pretty easily.  The only possible exception is a few cases where I might modify the table to avoid some cases that make me do overflow checking.  I still need to see if that's even possible.  But if it is and its only a few cases then I can just put those values in after the first pass of table generation.

It would probably take on the order of 20K cycles to generate the tables.  I don't think I will notice the extra 1ms at power-up in my application.