[DIS] [ASM] Dirty Math Tricks: Adventures in Division by Ten

Go To Last Post
62 posts / 0 new

Pages

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just for fun, I have castrated :D the OP's div_10 code even further:

;----------------------------------------
; 8-bit division by 10 routine.
;
; Input: R0 = dividend.
; Output: R1 = result = dividend/10.
;
; Registers R0 and R16 are clobbered.
; 5 words/10 clocks including RET.
; 
div10:
	ldi	r16,205
	mul	r0,r16
	ldi	r16,32
	mul	r1,r16
	ret 
;----------------------------------------

Warning: Grumpy Old Chuff. Reading this post may severely damage your mental health.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The OP's div_10 code was buggy at values like 19, 29, 39, ... or at 21, 31, 41, ... depending on whether the correction (INC) was done or not. Here's a fixed version, with remainder calculation (remainder lines marked, strip them if not needed, saves 5 cycles):

// input: A, output/value: R1, output/remainder: A
// total cycles: 13
PUSH A		//2 Me (for remainder)
INC A		//1 Me
CPI A, 0	//1 Me
BRNE x		//1,2 Me
DEC A		//1 Me
x:			//0 Me
LDI B, 51	//1 RetroDan
MUL A, B	//2 RetroDan
LSR R1		//1 RetroDan
POP A		//2 Me (for remainder)
SUB A, R1	//1 Me (for remainder)

Hope this is useful for someone out there.

My approach:

I made a list with values, RetroDan and RetroDan with INC. It looked somehow like this:
8 - 0 0
9 - 0 1
10 - 0 1
11 - 1 1

So one was too low, the other too high around exact matches (i.e. dividable by 10 with no remainder). So I took the one without INC (saves one cycle), INCreased the value before giving it to the RetroDan stuff, so all results were accurate - but "255" which becomes 0.

A quick check for the overflow after increasing (i.e. value=0), decrease it again (make it 255 again, which gives the correct result) and done.

P.S: I know it's an old thread, but I think it's an important one that's read by a lot of people for finding a good divide-by-10 routine.

Markus

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

But I would ask at first, whats the purpose to do a division by 10?

Mostly it was the decimal output conversion.
The fastest way to do so, was the subtraction method, since no divison nor multiplication was needed.
A small example for 16 bit ASCII output can be seen on:

http://www.avrfreaks.net/index.p...

Peter

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
The fastest way to do so, was the subtraction method, since no divison nor multiplication was needed.
But repeated subtraction that that routine uses is division. And the multiplies when done with the hardware multiply are very fast.

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Sorry for resurrecting such and old thread again but I just found it and feel I have a slightly different approach that may work better in the general case.

 

I have just finished (almost, still compile time to go!) doing a AC waveform generator. I went to great pains to avoid a single floating point number, and then bent over backwards to avoid any divides so that the needed library functions wouldn't be added in. Probably I will have acres of flash left that the math libraries could easily fit into, but it still feels like doing a better or at least 'artsier' design to use all fixed point arithmetic to do the variable frequency work. I went all out and 32bit index to step through a 512 entry array by right shifting by 23 first, and also did ramp function that further changed the step size by (0-127) then >>7 on the full number to keep maximum fractional accuracy. With 23 bits below the radix point the sine wave will be as step free as the PWM can make it.

 

   But then it came time to display the current, actual frequency based on the ramped step size and I had to divide 32 fixed point integer step size by some insane number like 549756. And I didn't want to use any divides, and I was concerned that doing a multiply by 1/549756 would either overflow or lose too much accuracy. So I calculated the whole thing using only right shifts of the original number. Since this thread was supposed to at least partially tutorial, here is the general, iterative approach I used.

 

Take the bizarre number you want to divide by and get the 1/X. For the above example that is 0.100000000, for mine it is 0.00000181898878775.  Start off with an approximation, you can even start with zero. The subtract the approximation from the desired floating point value, y = 0.1000000 - 0, then take 1/y = 10 and pick the nearest value of 2^n, in this case 8=2^3 makes a good start. So now the approximation is 0.0 + 1/2^3 = .125000000. Now we repeat: y = 0.10000000 - 0.1250000 = 0.0250000; 1/y = -40 because our first approximation was too high. For the next level subtracting 2^5 = 32 looks good. So now we have our approximation 0.0 + .1250000 - 0.0312500 = 0.09375. Already we are within 10%. Now repeat again: y = 0.1000000 - 0.0937500 = 0.00625; 1/y = 160, so 2^7=128 is a close value. Now the approximation is 0 + 0.125 - 0.03125 + 0.0078125 = 0.1015625 and now we are about 1% off. Again: 0.1000 - 0.1015625 = -0.0015625=y; 1/y = -640; try -2^9 gives the approximation 0.099609375. For this particular example it turns out that the series 1/2^3 - 1/2^5 + 1/2^7 - 1/2^9 . . . with alternating signs every 2 powers of two. You can pick the precision you want by how many shifts. Depending on how fast the multiply operation is the trade off between the shift and add/subtract may eventually take longer, but on pretty much any CPU without floating point, a single right shift is going to be faster than multiply and often several right shifts will still be faster. 

 

By the way, for the beginners, the AVR code for the above divide by ten would be

uint8_t  a,b;
uint16_t c,d;
uint32_t e,f;

b = (a>>3) - (a>>5) + (a>>7);  // b = a/10
d = (c>>3) - (c>>5) + (c>>7) - (c>>9) + (c>>11) - (c>>13) + (c>>15);   // d = c/10
f =  (e>>3) - (e>>5) + (e>>7) - (e>>9) + (e>>11) - (e>>13) + (e>>15)  - (e>>17) + (e>>19) - (e>>21) + (e>>23) - (e>>25) + (e>>27)  - (e>>29) + (e>>31);   // f = e/10

Mike

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

mbushroe wrote:
By the way, for the beginners, the AVR code for the above divide by ten would be

Silly question but have you actually counted the cycles for that "sea of shifts" compared to just doing the div ? Certainly for the last one (f=) I bet a single call to the library divide code will cost less than 21/23/25/27.. shifts!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Have you done any tests?

 

100/10 =10

 

you get 

12-3+0=9

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
uint8_t a;
uint16_t b;
uint32_t c;

uint16_t temp1;
temp1 = a * 26;
a = high_byte(temp1)    // a = (a * 26)/256

uint32_t temp2
temp2 = b * 6554;
b = high_word(temp2)    // b = (b * 6554)/65536

uint64_t temp3
temp3 = c * 429496730;
a = high_dword(temp3)    // c = (c * 429496730)/4294967296

When dividing by a constant, there is often a way to multiply by a value and then use upper bits for result...

(The above should be done in assembly for optimum efficiency)

David (aka frog_jr)

Last Edited: Tue. May 24, 2016 - 03:15 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson,

   You have a very valid point. I went back and looked at the App Note 200 which gave some information about multiply and divide subroutines and the code space and execution speed needed for each. I assume that these are code wise a single function that will handle all multiplies or divides of the correct sizes. I did not, however, see a single 32bit by 32 bit divide, so none of the offered routines comes close to reaching the large numbers and/or fine resolution fixed point arithmetic that I was using in my code. And the code as presented above might not be the best way to write it so that it compiles well. And I must also admit that although I have assembly coded 6502s, 68000s, 286s, and machine coded an earlier 16 bit machine I forget the name of, I have not assembly coded the ATMega or other embedded, RISC, memory mapped register CPUs before, so my register use may be a bit off. 

 

If we scale back to the most difficult case handled in AppNote 200, the 16/16 with 16 result then we are looking at the middle case with _only_ 7 shifts right. Rewriting to a more confusing but closer to assembly code version we used nested short shift rights and add/subtract the next

d = (c>>3) - (c>>5) + (c>>7) - (c>>9) + (c>>11) - (c>>13) + (c>>15);   // d = c/10

d = (c- (c- (c- (c- (c- (c- (c>>2) )>>2  )>>2  )>>2  )>>2  )>>2  )>>3; 

using 2 16 bit combined registers, and assuming that the instruction times listed apply for pair registers as well as single

 

                 clocks
	xor cd,cd   1   // clear results rgister pair
	ld  ab,_C   2   // load 16 number to be divided by 10
	lsr ab,2    1   // do innermost shift, with result in >>15
	movw  cd,ab 1   // since result is empty can straight load for first step
	lsr ab,2    1   //  do next outer, result >>13
	sub cd,ab   1   //  alternating signs in original expression mean always sub here
        lsr ab,2    1   //  >>11
	sub cd,ab   1
	lsr ab,2    1   // >>9
	sub cd,ab   1
	lsr ab,2    1   // >>7
	sub cd,ab   1
	lsr ab,2    1   // >>5
	sub cd,ab   1
	lsr ab,3    1   // first one was >>3
	sub cd,ab   1   // _C/10 now in cd
	ld  _d,cd   2   // save result in memory
	   clocks  19

This means that a specialized, one divisor only 16bit by 16bit divide would complete in 19 clock cycles, which is much faster then the execute optimized divide subroutine of 117 clock cycles. The advantage of using the routine is that it works for any divisor. But if your program needs only one dedicated division by a fixed number, you can essentially do the long division first by hand, and only load in the actual math steps that get used, with a nearly factor of 10 improvement in speed. Other divisors may not need as many shift terms to get adequate accuracy. And I don't know how much slower a 32bit by 32bit general purpose integer divide would be, accept that I would expect a minimum of 4 times the 16x16.

 

Mike

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

OOPS!

 

yes, you have me there. This method is still an approximation. The problem is that dividing by the smaller terms loses the fractional portion. An alternative would be to recast the 8 bit value 100 into a 16bit value where it is in the high byte, do exactly the same steps, then take only the high byte of the result. Then the lost fractional parts are nolonger lost and the answer is more correct (probably never perfect across all numbers). But having to use a 16x8 divide operation on an originally 8 bit value does seem a wast.

 

I guess that in order to get the exact answer you have to do the long division.

 

EDIT: Further testing show that small number sometimes read 1 high, large numbers sometimes read 1 low. This is due to having numbers small enough to only get non-zero responses from 1 or 2 terms, which are not as accurate as all 3 for the byte operation. You almost have to run the routine to approximate 1/10th, then multiply the result by ten and check for being over or under and then correcting by one. No longer sounds quiet so efficient.

 

Mike

Last Edited: Tue. May 24, 2016 - 07:56 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

unless you really need speed mul with 26 is a bad extimate (first error at 69 that give 7).

 

For best result with mul use the biggest 8 bit multiplicand in this case 205 and then do the shifts, that is correct for all 8 bit numbers. 

Pages