[DIS] [ASM] Dirty Math Tricks: Adventures in Division by Ten

Go To Last Post
62 posts / 0 new

Pages

Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

TITLE: DIRTY DAN's DIRTY MATH TRICKS

PURPOSE: On Going Tutorial on Math Tricks in AVR Native Machine Language for Beginners.

PREAMBLE: While trying to help someone in another FORUM with a Division-by-60 routine to converting seconds to minutes, someone made the remark:

Quote:

What RetroDan has is a quick and dirty approximation!

I think it was meant as an insult, but it gave me a great name for the type of binary routines I love: Dirty Math!

Quick & Dirty Routines that are small and super fast.

CAVEAT LECTOR: If you don't like taking chances and experimenting and don't really care about how big your programs are, or how fast they excecute then this is not a place for you. May I recommend the Atmel Appnotes #200, 201 and 204 that are chock full great standard math routines for the AVR microcontrollers.

Last Edited: Fri. Mar 31, 2006 - 09:40 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

TITLE: DIRTY DAN's MATH TIPS

INTRODUCTION:

Here are a few quick tips before we start:


DIRTY DAN'S TIP #1: ALWAYS USE A POWER OF TWO 

If you are writing any program or routine, even if the savings are not obvious at first, pick numbers that are powers-of-two. Somewhere down the line you'll probably be glad you did. Make it a habit and your microcontroller will thank you too.

For example: if you need to take an average, don't take 10 readings and divide by 10, take 8 or 16. If you're setting up a table, don't make entries 10 bytes long, make them 8 or 16. It will greatly simplify things later. There's nothing magical about powers-of-ten to a microprocessor. They're set up for powers-of-two and the number of fingers we have means absolutely nothing to them.


DIRTY DAN'S TIP #2: AVOID DIVISION AT ALL COSTS! 

While Addition, Subtraction and Multiplication can be done in just a few clock cycles, most division involves looping and uses a lot of processor time. Division is normally done by repeated subtractions, the "standard" 16x16 division routine from the Atmel Appnotes averages over 250 clocks cycles compare that to just 9 for a 16x16 multiply and just 2 for a 16x16 Addition.

One problem with high-level languages is that this dirty little fact is kept hidden from the programer. So while you're happily writing code that does Bilinear Interpolation or Cubic Splines, your processor is swearing at you under his breath everytime you ask for a division. You're wondering why things are grinding to a halt with smoke drifting up from the PCB.


DIRTY DAN'S TIP #3: IF YOU MUST DIVIDE, DO IT BY POWERS OF TWO! 

Division by a power of two can be easily and quick accoplished with just a few right-shifts. Each time you shift your bits to the right you are essentially dividing your number in half. So to divide by four you just shift-right twice.


DIRTY DAN'S TIP #4: IF YOU MUST DIVIDE, DO IT ONLY ONCE! 

If your equation has multiple divisions involved, try and re-arrange your equation so you only divide once.

For example: if you need to divide by 10 and later divide by 12, perhaps you can re-arrange things and just divide once by 120 and be done with it.

Last Edited: Thu. Apr 13, 2006 - 08:11 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

TITLE: DIRTY DAN"s MATH TRICKS: DIVISION-BY-TEN

PREAMBLE:

I've been working on my own driver routines for the LCD Display on the Butterfly (same as STK502) that are super small and fast. When it came time to write a little routine that displays a number in decimal, rather than Hex, I ran into my old nemesis DIVISION!

To be more precises, Division-by-Ten. Since we humans are so fond of Division-by-Ten, I thought it would make an excellent starting point for our examination of my Dirty Math Tricks.

For this routine, all we want is to take a single byte value and display it on the LCD Display in Decimal Number Form. So we need to take a number with a maximum value of $FF and break it into three: 2 / 5 / 5.

The way most of us humans would do this is to first divide by it by 100 and then by 10 and then keep the remainder;

255 / 100 = 2     (remainder of 55)
55  / 10  = 5
remainder = 5

Oh no, that means we need to have two division routines! One for division by 100 and another by ten. Mybe we should just use a general division routine and change the divisor from 100 to 10 rather than have two routines. So I go and check the Atmel Appnote 200 looking for a general 8x8 unsigned division routine and here is what I found:

;***************************************************************************
;* "div8u" - 8/8 Bit Unsigned Division
;*
;* This subroutine divides the two register variables "dd8u" (dividend) and 
;* "dv8u" (divisor). The result is placed in "dres8u" and the remainder in
;* "drem8u".
;*  
;* Number of words	:14
;* Number of cycles	:97
;* Low registers used	:1 (drem8u)
;* High registers used  :3 (dres8u/dd8u,dv8u,dcnt8u)
;***************************************************************************

.def	drem8u	=r15		;remainder
.def	dres8u	=r16		;result
.def	dd8u	=r16		;dividend
.def	dv8u	=r17		;divisor
.def	dcnt8u	=r18		;loop counter

div8u:	sub	drem8u,drem8u	;clear remainder and carry
	ldi	dcnt8u,9	;init loop counter
d8u_1:	rol	dd8u		;shift left dividend
	dec	dcnt8u		;decrement counter
	brne	d8u_2		;if done
	ret			;    return
d8u_2:	rol	drem8u		;shift dividend into remainder
	sub	drem8u,dv8u	;remainder = remainder - divisor
	brcc	d8u_3		;if result negative
	add	drem8u,dv8u	;    restore remainder
	clc			;    clear carry to be shifted into result
	rjmp	d8u_1		;else
d8u_3:	sec			;    set carry to be shifted into result
	rjmp	d8u_1

Only 14 program steps is not bad. Almost 100 clock cycles per call and we need to call it twice fort a single byte conversion. That's about 200 clock cycles each. I'm sure we can do better than that...

Last Edited: Thu. Apr 13, 2006 - 08:13 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

TITLE: DIRTY DAN"s MATH TRICKS: DIVISION-BY-TEN (Continued)

Previously we divided our 1 byte number by 100 then by 10 and when it comes time to expand our routine to handle 16 bit numbers we'll need to divide by 10,000, 1,000, 100 and finally by 10. At approx. 100 cycles per call that's 400 clock cycles per number.

Maybe we should look at the problem again from the "other side" the right-hand-side instead of left.

255 / 10 = 25   Remainder = 5
25  / 10 = 2    Remainder = 5
2   / 10 = 0    Remainder = 2

Well we've managed to extract the 255 again, but now we need Three calls to the division routine and for a 16 byte number we'll need Five. The only thing we have going for us using this method is that if we can find a "special" division routine for Division-by-Ten that is smaller and/or faster than the Atmel routine, we might have something we can work with...

Last Edited: Fri. Mar 31, 2006 - 05:50 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

TITLE: DIRTY DAN"s MATH TRICKS: DIVISION-BY-TEN (Continued)

I remember someone posting a file of Division-by-a-Constant Routines, so why don't we see if we can get luck there.

Here is what I found, exactly what we're looking for, a specialized routine for Division-by-Ten:

;***************************************************************************
;* Function "Div16_10"
;* Divides an unsigned 16 bit word (XH:XL) by 10
;* Returns quotient in YH:YL and remainder in XL
;*
;* Author: Andreas Lenze (andreas.lenze@t-online.de)
;* Equations by D: W. Jones:
;*
;*	Reciprocal mul w. extra precision:
;*	unsigned int A;
;*	unsigned int scaled_reciprocal = 0xCCCD;
;*	unsigned int Q; /* the quotient */
;*
;*	Q = ((A * 0xCCCD) >> 19)
;*	
;* Uses: high regs: 7 (r17, r18, r19, X, Y)
;*	 low regs:  3 (r0, r1, r2)
;*
;*	 words:     36 (w. push/pop = 8 words)
;*	 cycles:    46 (w. push/pop = 16 cycles)
;*
;* Note: Hardware multiplier required ("mul" instruction)
;*
;***************************************************************************

Div16_10:
	push	r2
	push	r19
	push	r18
	push	r17

	ldi	YH,0xCC		; scaled reciprocal for /10
	ldi	YL,0xCD

	; Q = A * 0xCCCD
	; (r19:r18:r17[:rXX] = XH:XL * YH:YL)
        clr     r2
        mul     XH, YH		; ah * bh
        movw    r19:r18, r1:r0
        mul     XL, YL		; al * bl
	mov	r17,r1		; r0 to [rXX] is superfluous
        mul     XH, YL		; ah * bl
        add     r17, r0
        adc     r18, r1
        adc     r19, r2
        mul     YH, XL		; bh * al
        add     r17, r0
        adc     r18, r1
        adc     r19, r2

	; Q = Q >> 16: use r19:r18 as word
	; Q = Q >> 3
	lsr	r19		; do the last 3 shifts
	ror	r18
	lsr	r19
	ror	r18
	lsr	r19
	ror	r18

	; r19:r18 now "Q" (= result >> 19)
	; R = A - 10*Q;
	ldi	r17,10		; multiply r19:r18 by 10
	mul     r18, r17	; al * bl
	sub	XL,r0
	clr	XH
	movw	YL,r18		; make copy of Q

	; XL holds "R"
	; YH:YL holds "Q"
	pop	r17
	pop	r18
	pop	r19
	pop	r2

	ret

Wow, I looks huge! Let's see 46 program steps, that's a little on the heavy side, and 10 registers used ouch!. However, it is a 16 bit Divide-by-Ten Routine and it excecutes in about 1/2 the time that the Atmel 8 bit routine did, so we're moving in the right direction. Two or Three calls to this would be 100 to 150 cycles versus the 200 to 300 using the Atmel Routine.

If we can't come up with something on our own, perhaps we could strip this down to an 8 bit routine, or perhaps just bite the bullet because sooner-or-later we're gonna' need to convert 16 bit values anyway...

Last Edited: Thu. Apr 13, 2006 - 08:18 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

TITLE: DIRTY DAN"s MATH TRICKS: DIVISION-BY-TEN (Continued)

A while back, while literally sitting-on-the-throne, I happened to realized that 255 is evenly divisable by 5 and that it is only one away from 256 which is a Power-of-Two. After a while, it becomes a habit to always be on-the-lookout for Powers-of-Two. Either that or I'm slowly drifting into the world of padded-rooms and nice doctors in white smocks. Of all the things I've lost, I miss my mind the most.

To divide by Ten we might be able to first Divide-by-Five then Right-Shift to Divide-by-Two and the result would be a Division-by-Ten. I stored that fact away and wondered if the 255-256 link could be put to use sometime.

Pehaps today is a good day to drag it out of the bottom drawer and dust it off...

Let's see: 255 divided by 5 is 51 and 255 is awlful close to 256 and furthermore division by 256 is super-super easy and requires ZERO program steps. The "Dirty Trick" to that is to simply ignore the least significant Byte and we've essentially Divided our number by 256.

Since 255 is almost equal to 256, then 255/256 is almost equal to 1

Since multiplication by one does not change a value
Then 1/5 times 255/256 shoud be pretty close to 1/5

Since division by 256 can be accomplished with the Dirty Math Trick of ignoring the lowest byte, we'll leave that to last and our equation becomes:

1/5 time 255 and later we'll truncate the lower byte.

1/5 x 255 = 255 / 5 = 51

So mathematically speaking if we multiply our number by 51 and later
ignore the lower byte, we should get a pretty good approximation of
Division-by-Five. That's where our first routine will start...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

TITLE: DIRTY DAN"s MATH TRICKS: DIVISION-BY-TEN (Continued)

Let's start by writing a routine that muliplies by 51 and we'll look at the high byte and see if our mathematical assumptions were correct.

Here is the code:

DIV5: LDI B,51
      MUL A,B  ;ANSWER LEFT IN R1
      RET 

Great it's only 3 lines long!!! Let's test it's accuracy.
We're interested in a single byte so values can go from zero to 255
so lets check a few numbers in that range and see how it performs:

 10 =>  1 (off by -1)
 50 =>  9 (off by -1)
100 => 19 (off by -1)
150 => 29 (off by -1)
200 => 39 (off by -1)
250 => 49 (off by -1)

Hmm, we're alway off by minus one, so if we add one, we should be right on the money!

So the new Division-by-Five Routine becomes:

DIV5: LDI B,51
      MUL A,B
      INC R1 ; ANSWER IN R1
      RET 

So we're half-way-home, all we need to do now is divide our answer by two and we should have a "Quick and Dirty" Divide-by-Ten Routine...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

TITLE: DIRTY DAN"s MATH TRICKS: DIVISION-BY-TEN (Continued)

There's two ways we can use our Division-by-Five Routine for a Division-by-Ten Routine.

We can modify the DIV5 routine. Earlier I stated that a "Quick and Dirty" way to Divide-by-Two is to just Shift your bits to the right. This is how we modify our DIV5 Routine to get a DIV10:

DIV10: LDI B,51 
       MUL A,B 
       INC R1   ;R1=A/5
       LSR A    ;R1=(A/5)/2 = A/10 
       RET 

The other way we could have accomplished the same thing would be to have the DIV10 routine call the DIV5 routine and that way we "kill-two-birds-with-one-stone" because we get two usable routines from it:

DIV10: RCALL DIV5  ;R1=A / 5
       LSR R1      ;R1=(A/5) / 2 = A/10
       RET

DIV5:  LDI B,51 
       MUL A,B 
       INC R1      ;ANSWER IN R1 
       RET

Just to be on-the-safe-side let's double check the accuracy of these routines:

 10 =>  1 Correct! 
 50 =>  5 Correct! 
100 => 10 Correct! 
150 => 15 Correct! 
200 => 20 Correct! 
250 => 25 Correct! 

So we did it. We have our first "Dirty Math" Routine that is 100% accurate over the range we need.
Stay tuned however, there's still more to this story...

Last Edited: Fri. Mar 31, 2006 - 07:26 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

TITLE: DIRTY DAN"s MATH TRICKS: DIVISION-BY-TEN (Continued)

PRE-SUMMARY:

Not bad for a days work eh? We took a rather large routine that we needed and reduced it to a "Quick and Dirty" little routine using a few "Dirty Math Tricks".

Let's compare and see how well we did.
First let's compare to the Atmel 8x8 Unsigned Division Routine:

                 DIV10a   DIV10b   ATMEL 
REGISTERS:         2        2        4
 DIFFERENCE:      -2       -2        -
  PERCENT:       -50%     -50%       -

PROGRAM WORDS:    5         7       15(incl RET)
 DIFFERENCE:    -10        -8        - 
  PERCENT:      -66%      -53%       -

SPEED/CYCLES:     5         8       97
 DIFFERENCE:    -92       -89
  PERCENT:      -95%      -92%

So we've used 1/2 the register space, about 60% less program space and we're running at about 20 times the speed that the Atmel Routine does.

Let compare to the Lenze/Jones Routine:

                 DIV10a   DIV10b   LENZE 
REGISTERS:         2        2       10
 DIFFERENCE:      -8       -8        -
  PERCENT:       -80%     -80%       -

PROGRAM WORDS:    5         7       36
 DIFFERENCE:    -31        -29       - 
  PERCENT:      -86%       -80%      -

SPEED/CYCLES:     5         8       46
 DIFFERENCE:    -41       -38        -
  PERCENT:      -89%      -83%       -

Against the Lenze/Jones routine we're using 80% less Register Space, about 85% less Program Space and w'e're running at between 6 to 10 times the speed.

Many might think that this is a great accomplishment and that I should stop at this point. Especially those that probably would have just used the larger routines and not had a second thought about it.

Well, obviously some of you don't know me very well.
I think I can pull a few more "Dirty Tricks" from out of my hat and turn this Dirty Littlle Routine into something totally scandalous. I'm not happy until the routine is so small that it pulls numbers from thin air, like magick...

Last Edited: Thu. Apr 13, 2006 - 08:24 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

TITLE: DIRTY DAN"s MATH TRICKS: DIVISION-BY-TEN (Continued)

Let's re-examine what it is we're doing mathematically with our Dirty Little DIV10 Routine:

We're multiplying by 51, adding one, then dividing by two, and then truncating to get a final division of 256:

A / 5 ~= [ [ ( A x 51 ) + 1 ] / 2 ] / 256

Let's ignore that final division by 256 because it's accomplished with ZERO program steps by ignoring the lower byte and there's no way we can improve on that unless there are opcodes that can move us backwards in time.

So we want to look closely at the right side of this equation:

256 x [A / 5] ~= ( 51A + 1) / 2
256 x [A / 5] ~= 51A/2 + 1/2

Well 1/2 doesn't really mean much to a microprocessor, you either have a bit that is on or its off. We'll ignore it for now and if need be, we'll increment something in the end like we did with our DIV5 routine.

So now we have:

256 x [A / 5] ~= 51A/2
256 x [A / 5] ~= 25.5 x A
256 x [A / 5] ~= 25A

Since we dropped a 1/2 earlier and now we're stuck with another 0.5 perhaps we should round the 25.5 upto 26 rather than truncate it down to 25.

256 x [A / 5] ~= 25.5 x A
256 x [A / 5] ~= 26A

So to divide by 10 all we need to do is multiply by 26 and ignore lower byte:

MUL10: LDI B,26
       MUL A,B   ;ANSWER IN R1

It sounds incredable, but let's check the results and see what our error ratio is:

 10 =>  1  Correct!
 50 =>  5  Correct!
100 => 10  Correct!
150 => 15  Correct!
200 => 20  Correct!
250 => 25  Correct!

So it's 100% accurate over the range we need and it's only two program lines.

We started with rather large routines requiring 50 to 100 clock cycles per call and have something that runs in just three.

Can we reduce it even more?

Not really, not until Atmel comes out with a MULI command that allows us to multiply a register with an immedaite value rather than loading it into another register first.

Maybe next year...

It would probably look like this:

DIV10: MULI A,26  ;ANSWER IN R1
Last Edited: Fri. Mar 31, 2006 - 09:14 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

TITLE: DIRTY DAN"s MATH TRICKS: DIVISION-BY-TEN (Conclusion)

IN CONCLUSION:

By systematically applying a few Dirty Tricks and following our instincts combined with a little high-school math, we took some rather large routines and reduced them down into just a couple program steps. Something that many would say is impossible., but never say never my friend. If I can do it, so can you!

REQUEST FOR FEED BACK:

I hope you found this posting entertaining as well as educational.

If you would like to see more like it, please be sure to let the moderators know.

Thank you for your time and consideration.

Last Edited: Thu. Apr 13, 2006 - 08:27 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
MUL10: LDI B,26
       MUL A,B   ;ANSWER IN R1

Quote:

So it's 100% accurate over the range we need and it's only two program lines.

This is not true. For a 'div by 10' this is an approximation which will not yield correct results for all integers between 1 and 255 decimal (try for example 169, 159 ... for 'A' to see it fail). The 'scaled reciprocal' for a correct 'division by 10' of an 8 bit integer would have to be at least 9 bits to reach the required precision.
Please check out the mathematical basics for this technique at http://www.cs.uiowa.edu/~jones/bcd/divide.html

The claims in your comparison

                 DIV10a   DIV10b   LENZE
REGISTERS:         2        2       10
 DIFFERENCE:      -8       -8        -
  PERCENT:       -80%     -80%       -

PROGRAM WORDS:    5         7       36
 DIFFERENCE:    -31        -29       -
  PERCENT:      -86%       -80%      -

SPEED/CYCLES:     5         8       46
 DIFFERENCE:    -41       -38        -
  PERCENT:      -89%      -83%       -

are also incorrect: the '46 cycles' for 'Lenze' is including register saves (push/pop), the naked division routine w/o reg saves and getting the remainder is:

	ldi	YH,0xCC		; scaled reciprocal for /10
	ldi	YL,0xCD

	; Q = A * 0xCCCD
	; (r19:r18:r17[:rXX] = XH:XL * YH:YL)
        clr     r2
        mul     XH, YH		; ah * bh
        movw    r19:r18, r1:r0
        mul     XL, YL		; al * bl
	mov	r17,r1		; r0 to [rXX] is superfluous
        mul     XH, YL		; ah * bl
        add     r17, r0
        adc     r18, r1
        adc     r19, r2
        mul     YH, XL		; bh * al
        add     r17, r0
        adc     r18, r1
        adc     r19, r2

	; Q = Q >> 16: use r19:r18 as word
	; Q = Q >> 3
	lsr	r19		; do the last 3 shifts
	ror	r18
	lsr	r19
	ror	r18
	lsr	r19
	ror	r18

	; r19:r18 now "Q" (= result >> 19)

That's 25 cycles for a 'divide 16 bit number by 10' (result BTW is always correct over the full range 1 - 65535).

Andreas

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well now, Andreas... The argument could be made that RetroDan's solution is *more correct* than the traditional integer approach.

169/10 = 16 if we follow traditional definitions.

Of course, the "true" result is 16.9. But we cannot express 16.9 in an integer. So we round. The traditional approach for computer integer arithmetic is to always round down.

Mathemeticians would tell us that we should round anything from 0.0 up to 0.49999999... as 0, and anything from 0.500000000...01 up to 0.9999999999999999... as 1. There is a gray area for 0.5 exactly. I was taught a convention that in that case, you round to the even-numbered integer.

From that prespective, RetroDan's solution comes much closer to the "mathematically" correct solution. 169/10 = 17.

As for the amount of pushing and popping required... well, that's relative to the register usage conventions at hand. In the worst case (with all destoryed registers saved and restored), RetroDan's solution is still smaller and faster than the best-case (with no destroyed registers saved and restored) "Lenze" solution.

Last Edited: Fri. Mar 31, 2006 - 03:18 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yup, actually I thought the *26 then ditch 8 bit thing was really clever. Obviously it works because 26/256 is 0.1015625 which for small numbers is OK as a way of multiplying by roughly 0.1. If you tried a larger value like 800 then I guess it'd be 800 * 26 = 20,800 and divided by 256 this is then 81 - so the extra 0.0015625 has come into play to create the error. But on small numbers it is a clever technique for an "integer divide"

In fact just checked and it's good for all values up to 630 in fact, the first one where it goes wrong is 631

Cliff

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

From that prespective, RetroDan's solution comes much closer to the "mathematically" correct solution. 169/10 = 17.

Ah well, Luke - for '169', yes. But even if we assume an 'implicit rounding' scheme, this is not handled correctly, for ex. in the range 0-9 we get '0' as result up to dividend 9, ditto e.g. in the 50-59 range ('5' up to 59).

Mathematically correct - as to the mathematicans - would be the 'increment result' from '5' or '55' upwards. OP claims '100% correct' for the algorithm which it isn't.
It would be correct if it either were returning the remainder together with the result or providing the mathematically correct rounding you described.

Quote:
In the worst case (with all destroyed registers saved and restored), RetroDan's solution is still smaller and faster than the best-case (with no destroyed registers saved and restored) "Lenze" solution.

True, except for the fact that the OP is manipulatively comparing a 'sort of 8 bit integers division' (nice as long as you don't need correct results) with a 16 bit integers division (add another 6 cycles to get the remainder - blows it up to 31 ... geeeze ...) which, thanks to Mr Jones, does return correct results.

BTW: As this is the 'tutorial' forum (contents aimed at beginners) I honestly think that it would be more helpful if an approximation technique were explecitely labeled as such instead of trying to sell it off as real binary mathematics.

Andreas

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

alenze wrote:
BTW: As this is the 'tutorial' forum (contents aimed at beginners) I honestly think that it would be more helpful if an approximation technique were explecitely labeled as such instead of trying to sell it off as real binary mathematics.

I apologoize Alenze, as the comparison was not meant to in any way diminsh the excellent routines you have in your Div_xx Package. It is full of "tried and true" routine, and are backed by the work of Jones. If anyone is looking for "correct" routine I recommend them highly

However, I don't think you read the CAVEAT LECTOR warning in my first post. These postings are to discuss "Dirty Math Tricks" and hopefully have some fun at the same time. If anyone is uneasy about what I'm doing, I suggest they stick to the proven routines like the ones you and Atmel provide and forget my threads even exist.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

"Closer to mathematically correct". I wasn't claiming that it was a miracle cure-all solution. In fact, it conveniently works out that way by pure coincidence of the range of numbers involved..

Anyway... The "traditional" approach to integer division (ignoring remainders for the moment), and Retro's "radical approach" to integer division are both wrong when compared to result using real numbers.

When compared to the "real number" result, the traditional results will always be less than or equal to the true result, with a maximum absolute error error of -1 LSB, and this maximum error is repeated consistently with a periodicity of 10 for all possible inputs.

RetroDan's results may differ from the "real number" result as well. The result may be greater than or less than the "real number" result. But it, too, will have a maximum absolute error of +/-1 LSB for the range of inputs we're interested in.

Note that the original intent of this thread was only to do 8-bit division by 10. And we actually only approach an absolute error of -1 LSB at the lower extreme with inputs of around 0, and +1 LSB at the upper extreme with inputs of around 630. In the middle of the range, the aboslute error will be limited to a band of +/- 0.5 LSB

In light of these results, I don't think it's fair to label his as "more wrong" than the traditional results.

Of course, if you want to have access to the remainders so that your application can fine-tune its interpretation of the result, then that's a strong argument in favour of some other approach.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Finding remainders:
- Multiply the result by 10. Label as 'x'.
- Subtract 'x' from the original input.
- This result is equal the the "Remainder". Using RetroDan's code, the remainder can either "positive" or "negative", but it will always have a magnitude of 10 or less. If negative, then feel free to subtract 1 from the quotient and add 10 to the remainder. And presto! You have a result that is totally identical to the "Lenze" result, for all possible 8-bit inputs.

; RetroDan's original code
DIV10: LDI B,51 
       MUL A,B 
       INC R1   ;R1=A/5 
       LSR R1   ;R1=(A/5)/2 = A/10 
; Luke's insertion
       MOV C, R1
       LDI B, 10
       MUL C, B
       SUB A, R0
       ; 'A' contains the (signed) remainder.
       ; sign of the remainder indicates whether it is an:
       ; - under-estimate (positive remainder), or
       ; - over-esitmate (negative remainder)
       ; 'C' contains the quotient
       ; Now a 100% _mathematically_ correct result for all 8-bit inputs.
       ; (Destroys the original input operand 'A'.
       ;  one could return the remainder in register
       ;  'B' at the cost of one extra MOV instruction.)
       RET
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Regarding the division by 5 routine you said:

Quote:
Hmm, we're alway off by minus one, so if we add one, we should be right on the money!

But the only numbers that you checked were ones that were actually divisible by five. In fact, these are the only numbers where this routine returns the correct answer. If someone wants to use this, then some correction would definitely be necessary. What I would suggest is to not add the one, then find the remainder as lfmorrison points out. In this case the remainder will always be in the range of [0, 5]. If the remainder is 5, add one to the quotient and make the remainder 0. Otherwise, the quotient and the remainder are correct.

Applying this to the divide by 10 (the '* 51' version, not the '* 26' version), again if you drop the 'add one', the remainder is always in the range of [0, 10]. So again the only adjustment is for a remainder of 10.

Keep in mind that these remainder ranges are only valid if the original number is 8 bit. I quickly goes awry for larger numbers. Even 261 produces an out of range remainder. You could subtract 10 (or 5 for the div by 5) from the remainder instead of just making it 0 to extend the range. To be more specific, it will be correct as long as the answer fit in 8 bits.

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

One way to do this (sorry, lfmorrison's signed remainder isn't always mathematically correct - "we should round anything from 0.0 up to 0.49999999... as 0, and anything from 0.500000000...01 up to 0.9999999999999999... as 1":

; Division by 10, useing reciprocal multiplication
;
; Call with:
;	8 bit dividend in r16
;
; Returns:
;	Result in r16
;	Remainder in r18

	mov	r18, r16	; save original dividend
	ldi	r17,26		; reciprocal, scaled *256, off a bit
	mul	r16,r17

; result of multiplication now in r1:r0
; use only r1, thereby effectively divide by 256
	dec	r1		; if imprecise scaling value influences result,
				; result will be '+1'. Decrement to avoid neg. 
				; value in later subtraction 
	ldi	r17,10		; re-use r17 for divisor
	mov	r16,r1		; save result/256
	mul	r16,r17		; find remainder by multiplication of result by 10d

; result again in r1:r0
; get remainder and correct result of 'by 10'
	sub	r18,r0
	cpi	r18,10
        brlo	done
	subi	r18,10
	inc	r16

done:
; result in r16, remainder in r18 

;optional rounding:
	cpi	r18,5
	brlo	done_r
	inc	r16
done_r:

(Quick hack, up to two or three cycles could be shaved off)

Andreas

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm going to start with an apology here, because I might be thinking differently, and appear to be stiffling a very simulating and valuable discussion. However as a tutorial, I feel this thread is misplaced. It should be in the main forum.

My thoughts of a tutorial are as an introduction, a quickstart guide, a pointer to get going and excited. It is a place for Newbies to look and gain confidence, and make things. Get involved and learn. Good examples to simply use (and with luck considered later), just standard routines.

If my manager came up and said, "Eric, we stock a 4k device, the next part up is $5 and that will take us under", then I would be fantastically interested in this thread. But If I had just found this site, looking for information, I might be scared away by the academic discourse of the moment. I'm looking for gentle introductions to do this, do that. If the device cannot divide by 10, and I cannot comprehend the details that experienced people are debating, then I'm off. Another Newbie scared away.

Engineering is about compromises."Gerald!, if you can do it in a day just get on with it.". Versus, "A new part in our inventory, can't you make it smaller". Such pragmatic stuff is key to success, but we have to be careful about introducing it here (apart from sowing thoughts).

I hope my opinion is understood in a positive sense. It is only one view among so many good people.
Cheers All.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Frankly, I don't buy that "negative remainder isn't correct" argument.

It's a non-standard representation. So are improper fractions. But just like improper fractions, the result is numerically dead-on.

AFAIC, 25 divided by 4 can be perfectly legally expressed as 6 remainder 1. And that is exactly identical to 5 remainder 5, or 7 remainder -3.

The only possible result from RetroDan's expression would still be guaranteed to be within 1 LSB of the true result. And the "traditional approach" cannot offer any better than getting within 1 LSB either.

But if it really bothers you, then you could always test the sign bit of the remainder... if it is set, then subtract 1 from the quotient and add 10 to the remainder.

DIV10: LDI B,51 
   MUL A,B 
   INC R1   ;R1=A/5 
   LSR R1   ;R1=(A/5)/2 = A/10 
; Luke's insertion 
   MOV C, R1 
   LDI B, 10 
   MUL C, B 
   SUB A, R0 
   ; 'A' contains the (signed) remainder. 
   ; sign of the remainder indicates whether it is an: 
   ; - under-estimate (positive remainder), or 
   ; - over-esitmate (negative remainder) 
   ; 'C' contains the quotient 
   ; We can return here with a totally valid and numerically correct result.
   ; But for the sticklers among us, let's convert it
   ; to a "cannonical form".
   SBRS A, 7 ; test the sign of result.
   RJMP PositiveRemainder
; We have a negative remainder; subtract 1 from
; the quotient and add 10 to the remainder.
   DEC C
   ADD A, B
PositiveRemainder
; The result is now in perfect "cannonical" form.
   RET

And given the limited range of values we're concerned with here, that result (after a single check of the sign on the remainder, and possibly a single subtraction of 1 / addition of 10) will be in the "cannonical" form. (Ie. a guaranteed underesitmate in the result, and a remainder that is guaranteed to be positive, and bound between 0 and 9.)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Broxbourne: I agree 100%. The give and take on this subject should be hashed out in another forum. Then and ONLY then should it become a tutorial. After all, what is a beginner to think when seeing the differing opinions. Most beginners are in no position to judge who is right or wrong on the details. I would like to see it here once that is done.

Rick

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Broxbourne wrote:
I'm going to start with an apology here, because I might be thinking differently, and appear to be stiffling a very simulating and valuable discussion. However as a tutorial, I feel this thread is misplaced. It should be in the main forum.

I hope my opinion is understood in a positive sense. It is only one view among so many good people.
Cheers All.

Thanks for the feed-back.

You may not know this, but actual tutorials are labeled [TUT] by moderator to distiguish them from other threads. The moderator has labeled this thread [DIS] for discussion. Another label is [CODE] for code segments that are not tutorials. I think this information can be found in the "Sticky-the-Bear" threads at the top of the list.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Dan, Thank you for the information. I did not realize the moderator was adjusting the titles of the thread according to the intended audience. I just thought this was the place to find tutorials and good examples of code and AVR applications.
Thanks for clarifying that. I'll make sure I only read [TUT] postings from now on to avoid getting confused by the more heady academic stuff, although I am still surprised to see such specialised discussion in a beginner forum.
Practically Yours.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi Luke,

Quote:
sorry, lfmorrison's signed remainder isn't always mathematically correct - "we should round anything from 0.0 up to 0.49999999... as 0, and anything from 0.500000000...01 up to 0.9999999999999999... as 1"

definitely didn't come through quite as intended (sorry about that):

Quote:
Frankly, I don't buy that "negative remainder isn't correct" argument

You are right and I phrased it badly: of course the signed remainder is correctly calculated (cleaner than in my example, too - 3 or four cycles saved).

It's the 'inbuild rounding' whose threshold is off by one, mathematically speaking ( :) ):

The quotient 'n.5' (6.5 for example if dviding 65/10) should already be rounded up to n+1 ('7'), here it is still left at '6'.
66/10 (6.6) is correctly rounded to '7' as 64/10 (6.4) becomes a rounded '6'. It's just the '.5' rounding which doesn't quite work (aren't those mathematicians a fascinating breed?).

Sorry again - apologies to Broxbourne, too, guess we got carried away a little here ...

Andreas

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thank you all for the wonderful feed-back. The information was invaluable and probably saved me a lot of head-scratching as you'll soon find out.

I was pleased-as-punch that my "quirky" little division by 10 routine worked. I was even more delighted to find out that it did a pretty good job of rounding to boot. However, when I got around to using it for my original purpose, I realized that rounding was the last thing I wanted.

The original use for the routine was to break-down an unsigned single byte for display on the Butterfly LCD. Well I was getting results like: 74, 75, 76, 77, 80, 80, 80, 81, 82, 83. Obviously the little DIV10 routine was working over-time and rounding, so a few fine adjustments were needed to "UN-ROUND" the results.

I know that you guys have already worked out your own versions, here is mine:

;-----------------------------------;
; RETRO (SYNTHETIC) DIVISION BY 10  ;
; ANSWER IN R1, R0=REM, A:PRESERVED ;
;-----------------------------------;
DIV10:  PUSH B
        LDI  B,26   ;MUL BY 26
        MUL  A,B    ;R1=A/10
        PUSH R1     ;BRUTE-FORCE CALC OF REMAINDER      
        LDI  B,10   ;CALC REM
        MUL  R1,B   ;R0=10xR1(QUOT)
        POP  R1     ;RESTORE QUOT
        SUB  R0,A   ;SUBTRACT REMx10
NODJST: NEG  R0     ;MAKE POSITIVE
         BRPL NONEG ;STILL NEG?
        ADD  R0,B   ;OOPS MAKE 
        DEC  R1     ;ADJUSTMENTS
NONEG:   RET

If you're wondering why I calculate the remainder "backwards" then negate it, it's to preserve the value of A. If there's no need to preserve A then the remainder could be calculated the other-way-around and eliminate the need for the NEG R0 line. However, you should be forewarned that "letting go" a NEG R0 without very good cause could lead to a charge of racism! Other than removing that one statement I don't know if it can be optimized any further, but it sure would be nice.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

this topic got me thinking a while ago.

I really liked the super simple idea of a divide by ten routine, but didn't like the error it has. So i was thinking a little further about the routine with the multiplier.

I was thinking the error came because of the difference between 26 and 25.6, and started thinking about a way to make this error smaller, so i came up with the idea of 65536/10 which would make 6553.6 or 6554 which has less error.

I wrote a test program on the ti83 to verify my idea, and found that it is correct up to 14 bits.
listing:

:ClrHome
:0->G
:0->F
:6554->B
:For(A,1,20000)
:If (iPart(A/10)=iPart((A*B)/65536))
:Then
:G+1->G
:Else
:F+1->F
:If F<4
:Output(F+2,1,A
:End
:Output(1,1,G
:Output(2,1,F
:End

This told me the first error would occur at 16389.

So up to 14 bit accurate!
Enough for a 8 bit number

I wrote me some asm and came up with:

div10:
; put your value in temp
;	ldi temp, 11					 ; example value is in temp
ldi mul_constL, low(6554)
ldi mul_constH, high(6554)			; load multiply value
mul temp, mul_constL				; mul lower
movw result_M2:result_L2, R1:R0		; store
mul temp, mul_constH				; mul higher
clr	result_H2						
add result_M2, R0					; add value
adc result_H2, R1
ret
; result is now in result_H2

The code above is hogging up to much vars. So i optimized it a little and came up with:

div10:
; input value in temp
ldi tempL, low(6554)
ldi tempH, high(6554)				; load multiply value
mul temp, tempL						; mul lower
mov tempL, R1						; store
mul temp, tempH						; mul higher
clr tempH						
add tempL, R0						; add value
adc tempH, R1
ret
; result is now in tempH

this is better. :D

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I tried running your 8 bit division code:

Quote:
;***************************************************************************
;* "div8u" - 8/8 Bit Unsigned Division
;*
;* This subroutine divides the two register variables "dd8u" (dividend) and
;* "dv8u" (divisor). The result is placed in "dres8u" and the remainder in
;* "drem8u".
;*
;* Number of words :14
;* Number of cycles :97
;* Low registers used :1 (drem8u)
;* High registers used :3 (dres8u/dd8u,dv8u,dcnt8u)
;***************************************************************************

.def drem8u =r15 ;remainder
.def dres8u =r16 ;result
.def dd8u =r16 ;dividend
.def dv8u =r17 ;divisor
.def dcnt8u =r18 ;loop counter

div8u: sub drem8u,drem8u ;clear remainder and carry
ldi dcnt8u,9 ;init loop counter
d8u_1: rol dd8u ;shift left dividend
dec dcnt8u ;decrement counter
brne d8u_2 ;if done
ret ; return
d8u_2: rol drem8u ;shift dividend into remainder
sub drem8u,dv8u ;remainder = remainder - divisor
brcc d8u_3 ;if result negative
add drem8u,dv8u ; restore remainder
clc ; clear carry to be shifted into result
rjmp d8u_1 ;else
d8u_3: sec ; set carry to be shifted into result
rjmp d8u_1

When I build this code, it gives me an error: "Register already defined by the .def directive". The above code refers to R16 with two different labels which should be okay. Is there a reason why it gives me this error?

Thanks,
John

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I've gotten this error recently as well. I think that it used to work, but it has changed in recent versions of AVR Studio. Not sure if this is a bug or a deliberate change.

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

First, let me say that I have no clue about AVR coding. An Arduino board is in the mail, but I haven't gotten it yet. I love digital math, though. So here's a little different take on the OP's trick.

Dividing is the same thing as multiplying by the inverse, so dividing by 10 is the same as multiplying by 0.1. The problem is that there isn't a neat binary representation of 0.1. But it turns out that you can use a good enough 16-bit representation of 0.1 to get accurate results (meaning the same number as division by 10, rounded down as expected). With some simple fixed-point math you can then do your multiply, discard the mantissa (the bit after the . (or , for Europeans)) and there you go.

What fixed-point format to use? It'll be obvious in a minute. Let's just assume 8 bits of mantissa for now and start finding the binary representation of 0.1d (decimal). Really what you're looking for is which powers of -2 you need to add up to get as close to 0.1 as possible. If you do that, you find:
1/10 =~ 0/2 + 0/4 + 0/8 + 1/16 + 1/32 + 0/64 + 0/128 + 1/256
the binary representation for this is:
00000000.00011001

So that's the 16-bit number you should multiply the original number by to divide by 10. But you'll notice we can use extra mantissa bits, as long as the top byte is 0. (If the top byte isn't 0 we might get overflow when we multiply.) So let's use 11 mantissa bits:
1/10 =~ 0/2 + 0/4 + 0/8 + 1/16 + 1/32 + 0/64 + 0/128 + 1/256 + 1/512 + 0/1024 + 1/2048
or 00000.00011001101

Here it is all nicely coded up in C. I've tested it for 0-255 and it gives the same result as C's /10 operation for every one. gcc did some awful things to the assembly, but I'll leave that part up to you guys because you're much better at it than I am.

uint8_t div10_fp(uint8_t number)
{
    // 1/10th with 11 mantissa bits
    // 1/10th = 0/2 + 0/4 + 0/8 + 1/16 + 1/32 + 0/64 + 0/128 + 1/256 + 1/512 +
    // 0/1024 + 1/2048
    // 0.00011001101
    uint16_t tenth = 0xcd;
    // result with 11 mantissa bits
    uint16_t result = number * tenth;
  9a:   99 27           eor     r25, r25
  9c:   2d ec           ldi     r18, 0xCD       ; 205
  9e:   30 e0           ldi     r19, 0x00       ; 0
  a0:   82 9f           mul     r24, r18
  a2:   a0 01           movw    r20, r0
  a4:   83 9f           mul     r24, r19
  a6:   50 0d           add     r21, r0
  a8:   92 9f           mul     r25, r18
  aa:   50 0d           add     r21, r0
  ac:   11 24           eor     r1, r1
  ae:   ca 01           movw    r24, r20
    // Drop mantissa to round down.
    return result >> 11;
  b0:   89 2f           mov     r24, r25
  b2:   99 27           eor     r25, r25
  b4:   86 95           lsr     r24
  b6:   86 95           lsr     r24
  b8:   86 95           lsr     r24
}
  ba:   99 27           eor     r25, r25
  bc:   08 95           ret

You can do the same thing for dividing by any constant, and I'm actually kind of surprised gcc doesn't just do this for me.

Tim

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

That is a great idea froody!

I had a look at your code, and optimized it a bit.

; the input number must be in register "input"
; value/10 is in register "result"
; register temp1 and R1:R0 are clobbered
div10:
	ldi temp1, 205
	mul temp1, input
	lsr R1
	lsr R1
	lsr R1
	mov result, R1
	ret

edit...

i have written a little test program in ti basic.

:ClrHome 
:0->G 
:0->F 
:For(N,0,1200) 
:If (iPart(N/10)=iPart((N*205)/2048)) 
:Then 
:G+1->G 
:Else 
:F+1->F 
:If F<5 
:Output(2+F,1,N 
:End 
:Output(1,1,"G="
:Output(1,3,G
:Output(2,1,"F="
:Output(2,3,F
:End 

It shows that it is correct up to 1028 and it fails at 1029, more than enough for 8 bit

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

brum, wow. gcc really sucks, huh? Nice job on optimizing the code. I'm going to install gcc 4.1 soon. Hopefully it won't suck as much.

Meanwhile, I clarified my original post a bit, and wrote some more, and now it's turned into a little article:

How to divide an unsigned 8-bit integer by a constant.
(c) 2006 by Tim Newsome
Thanks to RetroDan and several other on avrfreaks.net for the inspiration.

Dividing is slow, so let's avoid that. Instead, multiply by the inverse. You can write the inverse as a fixed-point binary number. Just like each digit in a normal binary number represents a power of 2, each digit after the decimal point in a binary number represents a power of -2. So:
10.11 = 1 * 2^1 + 0 * 2^0 + 1 * 2^-1 + 1 * 2^-2 = 2 + 1/2 + 1/4 = 2.75
Example: 0.5 = 1/2 = 0.1 (binary, 1-bit mantissa)
Example: 0.25 = 0/2 + 1/4 = 0.01 (binary, 2-bit mantissa)

Unfortunately there isn't a precise binary representation of most fractions, so you have to use an approximation. The same problem happens in decimal numbers, just not as often. For instance, 1/3 can be approximated as 0.3, which has an error of 0.3-1/3=-1/30. A better approximation is 0.33, which has an error of -1/300. The more digits you add, the smaller the error becomes. The same thing is true in binary:
1/3 =~ 1/2 = 0.1, error = 0.167
1/3 =~ 0/2 + 1/4 = 0.01, error = -0.0833
1/3 =~ 0/2 + 1/4 + 1/8 = 0.011, error = 0.0417
1/3 =~ 0/2 + 1/4 + 0/8 + 1/16 = 0.0101, error = -0.0208

Because we can't always find the precise inverse, we need to find one that's close enough, and gives us the right answer by rounding down always. (Rounding down just means ignoring the mantissa bits, which is easy to do.) That means we want to find an approximation that is at least as big as the real inverse. In other words, the error (computed as approximation minus real value) has to be positive. Furthermore, the error multiplied by 255 (our largest input value) must be smaller (less negative) than 1/divisor. For dividing by 3, the error must lie between 0 and 1/(3*255). Let's rework 1/3 keeping the error positive, and seeing how many bits we need before the error is small enough that we can use the approximation:
1/3 =~ 1/2 = 0.1, error = .167
1/3 =~ 1/2 + 0/4 = 0.10, error = .167
1/3 =~ 0/2 + 1/4 + 1/8 = 0.011, error = .0417
1/3 =~ 0/2 + 1/4 + 1/8 + 0/16 = 0.0110, error = .0417
1/3 =~ 0/2 + 1/4 + 0/8 + 1/16 + 1/32 = 0.01011, error = .0104
1/3 =~ 0/2 + 1/4 + 0/8 + 1/16 + 1/32 + 0/64 = 0.010110, error = .0104
1/3 =~ 0/2 + 1/4 + 0/8 + 1/16 + 0/32 + 1/64 + 1/128 = 0.0101011, error = .00260
1/3 =~ 0/2 + 1/4 + 0/8 + 1/16 + 0/32 + 1/64 + 1/128 + 0/256 = 0.01010110, error = .00260
1/3 =~ 0/2 + 1/4 + 0/8 + 1/16 + 0/32 + 1/64 + 0/128 + 1/256 + 1/512 = 0.010101011, error = .000651

So 9 bits of mantissa is enough, and our inverse is 0.010101011. Here is some C code that takes advantage of this:

uint8_t div3(uint8_t number)
{
    // Compute the intermediate value, with 9 mantissa bits.
    uint16_t intermediate = (uint16_t) number * 0xab;
    // Return the result, chopping off the mantissa bits.
    return intermediate >> 9;
}

A good compiler should compile that down to a single multiply instruction, a shift, and maybe a few extra instructions. This is drastically faster than using the generic divide code.

Let's do one more example, for division by 10. First figure out the inverse using the fewest number of mantissa bits:
1/10 =~ 1/2, error = 0.400
1/10 =~ 0/2 + 1/4, error = 0.150
1/10 =~ 0/2 + 0/4 + 1/8, error = 0.025
1/10 =~ 0/2 + 0/4 + 1/8 + 0/16, error = 0.025
1/10 =~ 0/2 + 0/4 + 1/8 + 0/16 + 0/32, error = 0.025
1/10 =~ 0/2 + 0/4 + 0/8 + 1/16 + 1/32 + 1/64, error = .00938
1/10 =~ 0/2 + 0/4 + 0/8 + 1/16 + 1/32 + 0/64 + 1/128, error = .00156
1/10 =~ 0/2 + 0/4 + 0/8 + 1/16 + 1/32 + 0/64 + 1/128 + 0/256, error = .00156
1/10 =~ 0/2 + 0/4 + 0/8 + 1/16 + 1/32 + 0/64 + 1/128 + 0/256 + 0/512, error = .00156
1/10 =~ 0/2 + 0/4 + 0/8 + 1/16 + 1/32 + 0/64 + 0/128 + 1/256 + 1/512 + 1/1024, error = .000586
1/10 =~ 0/2 + 0/4 + 0/8 + 1/16 + 1/32 + 0/64 + 0/128 + 1/256 + 1/512 + 0/1024 + 1/2048, error = .0000977
Finally we've reached the point where 255 * error is less than 1/10. So we need 11 mantissa bits, and our inverse is 0.00011001101. Here's the C code:

uint8_t div10(uint8_t number)
{
    // Compute the intermediate value, with 11 mantissa bits.
    uint16_t intermediate = (uint16_t) number * 0xcd;
    // Return the result, chopping off the mantissa bits.
    return intermediate >> 11;
}

So there you have it. Fast division by a constant of an unsigned 8-bit integer. If you followed what I've done here, you should be able to extend this to 16-bit or any other sized number.

Last Edited: Sun. Oct 22, 2006 - 10:50 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Froody,

Good article. However, in your code examples don't you need an explicit cast to the "number" parameter to force GCC to use 16-bit operations? I would expect GCC to do the multiplication as 8-bit first, then implicit cast to a 16-bit value when storing.

- Dean :twisted:

Make Atmel Studio better with my free extensions. Open source and feedback welcome!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

abcminiuser wrote:
Good article. However, in your code examples don't you need an explicit cast to the "number" parameter to force GCC to use 16-bit operations? I would expect GCC to do the multiplication as 8-bit first, then implicit cast to a 16-bit value when storing.

Thanks. Good point about the typecast. I think it's OK, because the hex constant is implicitly of type int, which is 16 bits, on AVR (right?). So it all works out. But it certainly is safer to add the typecast, so I'll edit my post and stick it in there.

Tim

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
You can do the same thing for dividing by any constant, and I'm actually kind of surprised gcc doesn't just do this for me.
...
gcc really sucks, huh?

I don't know how you can come to this conclusion from this thread. Up until your post this entire thread has been about assembly language and therefore has absolutely nothing to do with gcc. Also, the C language has absolutely no knowledge about fixed point numbers, so how could you possibly expect it to optimize your code in this manner?

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Koshchi wrote:
Quote:
gcc really sucks, huh?

I don't know how you can come to this conclusion from this thread. Up until your post this entire thread has been about assembly language and therefore has absolutely nothing to do with gcc. Also, the C language has absolutely no knowledge about fixed point numbers, so how could you possibly expect it to optimize your code in this manner?

In the first post you quoted, I showed some C code among the assembly code. You can see that gcc's generated code is really inefficient. For instance it does a mul by a register known to be 0. (Note that I've since installed gcc 4.1.1 and it does a lot better, although there are still some obvious optimizations possible.)

Look at this C code:

uint8_t function(void) {
    volatile uint8_t val = 123;
    return val / 10;
}

When I compile it, I get a call to a general divide routine. But the compiler knows I'm dividing by a constant, so it could do the exact optimization I outlined. Effectively it would rewrite the code to look like:

uint8_t function(void) {
    volatile uint8_t val = 123;
    return ((uint16_t) val * 0xcd) >> 11;
}

That is actually slightly larger than the original code, but a lot faster. If you really care about code size you could just use the div10() function. Then function() would be 1 instruction shorter (because you don't have to pass 10 to the divide function) but you do take the size penalty of div10() (10 instructions in my gcc 4.1.1 build).

Tim

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Try removing the "voltatile" keyword. You'll get a far better output.

There are pointy haired bald people.
Time flies when you have a bad prescaler selected.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

daqq it's an old thread, nearly 3/4 of an year old
:)

/Safari

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Still, I'm glad he brought it back since this will help me.

Edward

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Nothing like a nice jolt when I saw who the OP was! :shock: Then I realized it's an old thread. :P

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Wouldn't it be nice to not deal with the rounding error? If we were to add a compensation amount before multiplying by 26, we then have no error:

.def temp1 = r16
.def temp2 = r17
.def A = r18

// Input: A
// Output: r1
DivBy10:

	// calculate the amount needed to add to A to compensate for the rounding error to come 
	// compensation = (A + 3) / 64
	mov temp2, A
	ldi temp1, 3
	add temp2,temp1
	// "divide by 64" shortcut
	swap temp2
	andi temp2, 15
	lsr temp2
	lsr temp2
	
	// now use 'A - compensation'
	mov temp1, A
	sub temp1, temp2

	// Multiply by magic number
	ldi temp2, 26
	mul temp1, temp2

	ret

I've checked it out on my computer and seems like this works good, and doesn't eat many more cycles than the original.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

; the input number must be in register "input"
; value/10 is in register "result"
; register temp1 and R1:R0 are clobbered
div10:
   ldi temp1, 205
   mul temp1, input
   lsr R1
   lsr R1
   lsr R1
   mov result, R1
   ret 

This one rocks! Saved my day, today :)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

i suppose one could allocate a huge look up table ln FLASH on the larger AVR range ... practically instant result which i have done many many years ago... (code lost to time)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks to all for alively discussion. I need help on a 16 bit divide:
I have been using the DIV16u straight from apnote 1011 (AVR200) since 1998, and found a possible error.

When you have $7080 (28,800) divided by $3841 (14,401), the correct answer is: $0001 and remainder $FFF6 (1.99986).
However, the Atmel DIV16u returns $0001 remainder $383F, quite an error.

When you have $7080 (28,800) divided by $3840 (14,400), the correct answer is: $0002 and remainder $0000 (2.00000), and the Atmel code returns the correct same answer.

Please help, I need a fix fast, or am I wrong?

Vern Bunch, AVR consultant and AVR designer

<p>Vern</p>

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

vbunch wrote:
When you have $7080 (28,800) divided by $3841 (14,401), the correct answer is: $0001 and remainder $FFF6 (1.99986).
However, the Atmel DIV16u returns $0001 remainder $383F, quite an error.

Dude, WTF?

Let's do this manually, by long division.

$3841 goes into $7080 once.

(hex assumed)
      ____1___
3841 ) 7080
       3841 -
       ----
       383F

$3841 won't go into $383F, so the quotient is 1 and the remainder is $383F. The Atmel routine returns the correct answer.

The number $FFF6 might possibly occur in some floating point representation, but can never happen in unsigned integer math, since it is larger than the dividend.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Oh, I see where you got it from.

If you want to generate the fractional bits, take the remainder, multiply by 16 (shift left 4) and divide this by the divisor. Repeat until you have enough "(hexa)decimal" places.

      ____1FF...
3841 ) 7080
       3841 -
       ----
       383F <<4
       3841 * F -
       ----
        3821 <<4
        3841 * F -
         ----
         3641
          ... etc

Again, this is just confirming that the original division routine got it right. You didn't carry out enough steps to format the answer the way you expected.

Note that since the first remainder is guaranteed to be less than the divisor, each successive step is guaranteed to produce a quotient less than 16, ie a single hex digit. You have to manually concatenate these single digits to the number of places you want.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thank you my friends for your correct responses. It is my fault that I mentioned "remainder" when I expected a fully usable result, without having to divide again, and one that had 16 bit accuracy. So the Atmel division routine is "correct" but unusable without more computation. Since there was no mention that the result is unusable without further division, I niavely skipped the true math definition of "remainder" and went right to "answer", I should not have done that.

I have since written an assembly language routine that gives me a correct (and directly usable) 16 bit division in one run, and it is less time (156 cycles) then the published 16 bit divide routine. It took 3 weeks at 104 hours a week, and I'm way beyond tired. :D

<p>Vern</p>

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I have since written an assembly language routine that gives me a correct (and directly usable) 16 bit division in one run, and it is less time (156 cycles) then the published 16 bit divide routine. It took 3 weeks at 104 hours a week, and I'm way beyond tired

I guess it's too late to say that you could have "borrowed" the source for a / routine from an open source C compiler?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
I guess it's too late to say that you could have "borrowed" the source for a / routine from an open source C compiler?

Actually I suspect in this case he could not, because the answer he wanted was in a fixed point format. He got simple integer division from the appnote. The compiler would have given him full float, which would challenge me to adapt to a fixed format in assembler (and I consider myself rather good at it).

As a matter of interest, if we did look at the source for a / routine in an open source C compiler - let's say, and why not, the GCC compiler - would we be looking at assembler, or C?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

would we be looking at assembler, or C?

Assembler. Here's a snippet from gcc/config/avr/libgcc.s:

/*******************************************************
       Division 16 / 16 => (result + remainder)
*******************************************************/
#define	r_remL	r26	/* remainder Low */
#define	r_remH	r27	/* remainder High */

/* return: remainder */
#define	r_arg1L	r24	/* dividend Low */
#define	r_arg1H	r25	/* dividend High */

/* return: quotient */
#define	r_arg2L	r22	/* divisor Low */
#define	r_arg2H	r23	/* divisor High */
	
#define	r_cnt	r21	/* loop count */

#if defined (L_udivmodhi4)
	.global	__udivmodhi4
	.func	__udivmodhi4
__udivmodhi4:
	sub	r_remL,r_remL
	sub	r_remH,r_remH	; clear remainder and carry
	ldi	r_cnt,17	; init loop counter
	rjmp	__udivmodhi4_ep	; jump to entry point
__udivmodhi4_loop:
        rol	r_remL		; shift dividend into remainder
	rol	r_remH
        cp	r_remL,r_arg2L	; compare remainder & divisor
	cpc	r_remH,r_arg2H
        brcs	__udivmodhi4_ep	; remainder < divisor
        sub	r_remL,r_arg2L	; restore remainder
        sbc	r_remH,r_arg2H
__udivmodhi4_ep:
        rol	r_arg1L		; shift dividend (with CARRY)
        rol	r_arg1H
        dec	r_cnt		; decrement loop counter
        brne	__udivmodhi4_loop
	com	r_arg1L
	com	r_arg1H
; div/mod results to return registers, as for the div() function
	mov_l	r_arg2L, r_arg1L	; quotient
	mov_h	r_arg2H, r_arg1H
	mov_l	r_arg1L, r_remL		; remainder
	mov_h	r_arg1H, r_remH
	ret
	.endfunc
#endif /* defined (L_udivmodhi4) */

#if defined (L_divmodhi4)
	.global	__divmodhi4
	.func	__divmodhi4
__divmodhi4:
	.global	_div
_div:
        bst     r_arg1H,7	; store sign of dividend
        mov     __tmp_reg__,r_arg1H
        eor     __tmp_reg__,r_arg2H   ; r0.7 is sign of result
	rcall	__divmodhi4_neg1 ; dividend negative : negate
	sbrc	r_arg2H,7
	rcall	__divmodhi4_neg2 ; divisor negative : negate
	rcall	__udivmodhi4	; do the unsigned div/mod
	rcall	__divmodhi4_neg1 ; correct remainder sign
	tst	__tmp_reg__
	brpl	__divmodhi4_exit
__divmodhi4_neg2:
	com	r_arg2H
	neg	r_arg2L		; correct divisor/result sign
	sbci	r_arg2H,0xff
__divmodhi4_exit:
	ret
__divmodhi4_neg1:
	brtc	__divmodhi4_exit
	com	r_arg1H
	neg	r_arg1L		; correct dividend/remainder sign
	sbci	r_arg1H,0xff
	ret
	.endfunc
#endif /* defined (L_divmodhi4) */

Pages