## [DIS] [ASM] Dirty Math Tricks: Adventures in Division by Ten

62 posts / 0 new
Author
Message

TITLE: DIRTY DAN's DIRTY MATH TRICKS

PURPOSE: On Going Tutorial on Math Tricks in AVR Native Machine Language for Beginners.

PREAMBLE: While trying to help someone in another FORUM with a Division-by-60 routine to converting seconds to minutes, someone made the remark:

Quote:

What RetroDan has is a quick and dirty approximation!

I think it was meant as an insult, but it gave me a great name for the type of binary routines I love: Dirty Math!

Quick & Dirty Routines that are small and super fast.

CAVEAT LECTOR: If you don't like taking chances and experimenting and don't really care about how big your programs are, or how fast they excecute then this is not a place for you. May I recommend the Atmel Appnotes #200, 201 and 204 that are chock full great standard math routines for the AVR microcontrollers.

Last Edited: Fri. Mar 31, 2006 - 09:40 AM

TITLE: DIRTY DAN's MATH TIPS

INTRODUCTION:

Here are a few quick tips before we start:

`DIRTY DAN'S TIP #1: ALWAYS USE A POWER OF TWO `

If you are writing any program or routine, even if the savings are not obvious at first, pick numbers that are powers-of-two. Somewhere down the line you'll probably be glad you did. Make it a habit and your microcontroller will thank you too.

For example: if you need to take an average, don't take 10 readings and divide by 10, take 8 or 16. If you're setting up a table, don't make entries 10 bytes long, make them 8 or 16. It will greatly simplify things later. There's nothing magical about powers-of-ten to a microprocessor. They're set up for powers-of-two and the number of fingers we have means absolutely nothing to them.

`DIRTY DAN'S TIP #2: AVOID DIVISION AT ALL COSTS! `

While Addition, Subtraction and Multiplication can be done in just a few clock cycles, most division involves looping and uses a lot of processor time. Division is normally done by repeated subtractions, the "standard" 16x16 division routine from the Atmel Appnotes averages over 250 clocks cycles compare that to just 9 for a 16x16 multiply and just 2 for a 16x16 Addition.

One problem with high-level languages is that this dirty little fact is kept hidden from the programer. So while you're happily writing code that does Bilinear Interpolation or Cubic Splines, your processor is swearing at you under his breath everytime you ask for a division. You're wondering why things are grinding to a halt with smoke drifting up from the PCB.

`DIRTY DAN'S TIP #3: IF YOU MUST DIVIDE, DO IT BY POWERS OF TWO! `

Division by a power of two can be easily and quick accoplished with just a few right-shifts. Each time you shift your bits to the right you are essentially dividing your number in half. So to divide by four you just shift-right twice.

`DIRTY DAN'S TIP #4: IF YOU MUST DIVIDE, DO IT ONLY ONCE! `

If your equation has multiple divisions involved, try and re-arrange your equation so you only divide once.

For example: if you need to divide by 10 and later divide by 12, perhaps you can re-arrange things and just divide once by 120 and be done with it.

Last Edited: Thu. Apr 13, 2006 - 08:11 AM

TITLE: DIRTY DAN"s MATH TRICKS: DIVISION-BY-TEN

PREAMBLE:

I've been working on my own driver routines for the LCD Display on the Butterfly (same as STK502) that are super small and fast. When it came time to write a little routine that displays a number in decimal, rather than Hex, I ran into my old nemesis DIVISION!

To be more precises, Division-by-Ten. Since we humans are so fond of Division-by-Ten, I thought it would make an excellent starting point for our examination of my Dirty Math Tricks.

For this routine, all we want is to take a single byte value and display it on the LCD Display in Decimal Number Form. So we need to take a number with a maximum value of \$FF and break it into three: 2 / 5 / 5.

The way most of us humans would do this is to first divide by it by 100 and then by 10 and then keep the remainder;

```255 / 100 = 2     (remainder of 55)
55  / 10  = 5
remainder = 5
```

Oh no, that means we need to have two division routines! One for division by 100 and another by ten. Mybe we should just use a general division routine and change the divisor from 100 to 10 rather than have two routines. So I go and check the Atmel Appnote 200 looking for a general 8x8 unsigned division routine and here is what I found:

```;***************************************************************************
;* "div8u" - 8/8 Bit Unsigned Division
;*
;* This subroutine divides the two register variables "dd8u" (dividend) and
;* "dv8u" (divisor). The result is placed in "dres8u" and the remainder in
;* "drem8u".
;*
;* Number of words	:14
;* Number of cycles	:97
;* Low registers used	:1 (drem8u)
;* High registers used  :3 (dres8u/dd8u,dv8u,dcnt8u)
;***************************************************************************

.def	drem8u	=r15		;remainder
.def	dres8u	=r16		;result
.def	dd8u	=r16		;dividend
.def	dv8u	=r17		;divisor
.def	dcnt8u	=r18		;loop counter

div8u:	sub	drem8u,drem8u	;clear remainder and carry
ldi	dcnt8u,9	;init loop counter
d8u_1:	rol	dd8u		;shift left dividend
dec	dcnt8u		;decrement counter
brne	d8u_2		;if done
ret			;    return
d8u_2:	rol	drem8u		;shift dividend into remainder
sub	drem8u,dv8u	;remainder = remainder - divisor
brcc	d8u_3		;if result negative
clc			;    clear carry to be shifted into result
rjmp	d8u_1		;else
d8u_3:	sec			;    set carry to be shifted into result
rjmp	d8u_1
```

Only 14 program steps is not bad. Almost 100 clock cycles per call and we need to call it twice fort a single byte conversion. That's about 200 clock cycles each. I'm sure we can do better than that...

Last Edited: Thu. Apr 13, 2006 - 08:13 AM

TITLE: DIRTY DAN"s MATH TRICKS: DIVISION-BY-TEN (Continued)

Previously we divided our 1 byte number by 100 then by 10 and when it comes time to expand our routine to handle 16 bit numbers we'll need to divide by 10,000, 1,000, 100 and finally by 10. At approx. 100 cycles per call that's 400 clock cycles per number.

Maybe we should look at the problem again from the "other side" the right-hand-side instead of left.

```255 / 10 = 25   Remainder = 5
25  / 10 = 2    Remainder = 5
2   / 10 = 0    Remainder = 2```

Well we've managed to extract the 255 again, but now we need Three calls to the division routine and for a 16 byte number we'll need Five. The only thing we have going for us using this method is that if we can find a "special" division routine for Division-by-Ten that is smaller and/or faster than the Atmel routine, we might have something we can work with...

Last Edited: Fri. Mar 31, 2006 - 05:50 AM

TITLE: DIRTY DAN"s MATH TRICKS: DIVISION-BY-TEN (Continued)

I remember someone posting a file of Division-by-a-Constant Routines, so why don't we see if we can get luck there.

Here is what I found, exactly what we're looking for, a specialized routine for Division-by-Ten:

```;***************************************************************************
;* Function "Div16_10"
;* Divides an unsigned 16 bit word (XH:XL) by 10
;* Returns quotient in YH:YL and remainder in XL
;*
;* Author: Andreas Lenze (andreas.lenze@t-online.de)
;* Equations by D: W. Jones:
;*
;*	Reciprocal mul w. extra precision:
;*	unsigned int A;
;*	unsigned int scaled_reciprocal = 0xCCCD;
;*	unsigned int Q; /* the quotient */
;*
;*	Q = ((A * 0xCCCD) >> 19)
;*
;* Uses: high regs: 7 (r17, r18, r19, X, Y)
;*	 low regs:  3 (r0, r1, r2)
;*
;*	 words:     36 (w. push/pop = 8 words)
;*	 cycles:    46 (w. push/pop = 16 cycles)
;*
;* Note: Hardware multiplier required ("mul" instruction)
;*
;***************************************************************************

Div16_10:
push	r2
push	r19
push	r18
push	r17

ldi	YH,0xCC		; scaled reciprocal for /10
ldi	YL,0xCD

; Q = A * 0xCCCD
; (r19:r18:r17[:rXX] = XH:XL * YH:YL)
clr     r2
mul     XH, YH		; ah * bh
movw    r19:r18, r1:r0
mul     XL, YL		; al * bl
mov	r17,r1		; r0 to [rXX] is superfluous
mul     XH, YL		; ah * bl
mul     YH, XL		; bh * al

; Q = Q >> 16: use r19:r18 as word
; Q = Q >> 3
lsr	r19		; do the last 3 shifts
ror	r18
lsr	r19
ror	r18
lsr	r19
ror	r18

; r19:r18 now "Q" (= result >> 19)
; R = A - 10*Q;
ldi	r17,10		; multiply r19:r18 by 10
mul     r18, r17	; al * bl
sub	XL,r0
clr	XH
movw	YL,r18		; make copy of Q

; XL holds "R"
; YH:YL holds "Q"
pop	r17
pop	r18
pop	r19
pop	r2

ret
```

Wow, I looks huge! Let's see 46 program steps, that's a little on the heavy side, and 10 registers used ouch!. However, it is a 16 bit Divide-by-Ten Routine and it excecutes in about 1/2 the time that the Atmel 8 bit routine did, so we're moving in the right direction. Two or Three calls to this would be 100 to 150 cycles versus the 200 to 300 using the Atmel Routine.

If we can't come up with something on our own, perhaps we could strip this down to an 8 bit routine, or perhaps just bite the bullet because sooner-or-later we're gonna' need to convert 16 bit values anyway...

Last Edited: Thu. Apr 13, 2006 - 08:18 AM

TITLE: DIRTY DAN"s MATH TRICKS: DIVISION-BY-TEN (Continued)

A while back, while literally sitting-on-the-throne, I happened to realized that 255 is evenly divisable by 5 and that it is only one away from 256 which is a Power-of-Two. After a while, it becomes a habit to always be on-the-lookout for Powers-of-Two. Either that or I'm slowly drifting into the world of padded-rooms and nice doctors in white smocks. Of all the things I've lost, I miss my mind the most.

To divide by Ten we might be able to first Divide-by-Five then Right-Shift to Divide-by-Two and the result would be a Division-by-Ten. I stored that fact away and wondered if the 255-256 link could be put to use sometime.

Pehaps today is a good day to drag it out of the bottom drawer and dust it off...

Let's see: 255 divided by 5 is 51 and 255 is awlful close to 256 and furthermore division by 256 is super-super easy and requires ZERO program steps. The "Dirty Trick" to that is to simply ignore the least significant Byte and we've essentially Divided our number by 256.

Since 255 is almost equal to 256, then 255/256 is almost equal to 1

Since multiplication by one does not change a value
Then 1/5 times 255/256 shoud be pretty close to 1/5

Since division by 256 can be accomplished with the Dirty Math Trick of ignoring the lowest byte, we'll leave that to last and our equation becomes:

1/5 time 255 and later we'll truncate the lower byte.

```1/5 x 255 = 255 / 5 = 51
```

So mathematically speaking if we multiply our number by 51 and later
ignore the lower byte, we should get a pretty good approximation of
Division-by-Five. That's where our first routine will start...

TITLE: DIRTY DAN"s MATH TRICKS: DIVISION-BY-TEN (Continued)

Let's start by writing a routine that muliplies by 51 and we'll look at the high byte and see if our mathematical assumptions were correct.

Here is the code:

```DIV5: LDI B,51
MUL A,B  ;ANSWER LEFT IN R1
RET
```

Great it's only 3 lines long!!! Let's test it's accuracy.
We're interested in a single byte so values can go from zero to 255
so lets check a few numbers in that range and see how it performs:

``` 10 =>  1 (off by -1)
50 =>  9 (off by -1)
100 => 19 (off by -1)
150 => 29 (off by -1)
200 => 39 (off by -1)
250 => 49 (off by -1)
```

Hmm, we're alway off by minus one, so if we add one, we should be right on the money!

So the new Division-by-Five Routine becomes:

```DIV5: LDI B,51
MUL A,B
INC R1 ; ANSWER IN R1
RET
```

So we're half-way-home, all we need to do now is divide our answer by two and we should have a "Quick and Dirty" Divide-by-Ten Routine...

TITLE: DIRTY DAN"s MATH TRICKS: DIVISION-BY-TEN (Continued)

There's two ways we can use our Division-by-Five Routine for a Division-by-Ten Routine.

We can modify the DIV5 routine. Earlier I stated that a "Quick and Dirty" way to Divide-by-Two is to just Shift your bits to the right. This is how we modify our DIV5 Routine to get a DIV10:

```DIV10: LDI B,51
MUL A,B
INC R1   ;R1=A/5
LSR A    ;R1=(A/5)/2 = A/10
RET ```

The other way we could have accomplished the same thing would be to have the DIV10 routine call the DIV5 routine and that way we "kill-two-birds-with-one-stone" because we get two usable routines from it:

```DIV10: RCALL DIV5  ;R1=A / 5
LSR R1      ;R1=(A/5) / 2 = A/10
RET

DIV5:  LDI B,51
MUL A,B
RET```

Just to be on-the-safe-side let's double check the accuracy of these routines:

``` 10 =>  1 Correct!
50 =>  5 Correct!
100 => 10 Correct!
150 => 15 Correct!
200 => 20 Correct!
250 => 25 Correct! ```

So we did it. We have our first "Dirty Math" Routine that is 100% accurate over the range we need.
Stay tuned however, there's still more to this story...

Last Edited: Fri. Mar 31, 2006 - 07:26 AM

TITLE: DIRTY DAN"s MATH TRICKS: DIVISION-BY-TEN (Continued)

PRE-SUMMARY:

Not bad for a days work eh? We took a rather large routine that we needed and reduced it to a "Quick and Dirty" little routine using a few "Dirty Math Tricks".

Let's compare and see how well we did.
First let's compare to the Atmel 8x8 Unsigned Division Routine:

```                 DIV10a   DIV10b   ATMEL
REGISTERS:         2        2        4
DIFFERENCE:      -2       -2        -
PERCENT:       -50%     -50%       -

PROGRAM WORDS:    5         7       15(incl RET)
DIFFERENCE:    -10        -8        -
PERCENT:      -66%      -53%       -

SPEED/CYCLES:     5         8       97
DIFFERENCE:    -92       -89
PERCENT:      -95%      -92%
```

So we've used 1/2 the register space, about 60% less program space and we're running at about 20 times the speed that the Atmel Routine does.

Let compare to the Lenze/Jones Routine:

```                 DIV10a   DIV10b   LENZE
REGISTERS:         2        2       10
DIFFERENCE:      -8       -8        -
PERCENT:       -80%     -80%       -

PROGRAM WORDS:    5         7       36
DIFFERENCE:    -31        -29       -
PERCENT:      -86%       -80%      -

SPEED/CYCLES:     5         8       46
DIFFERENCE:    -41       -38        -
PERCENT:      -89%      -83%       -
```

Against the Lenze/Jones routine we're using 80% less Register Space, about 85% less Program Space and w'e're running at between 6 to 10 times the speed.

Many might think that this is a great accomplishment and that I should stop at this point. Especially those that probably would have just used the larger routines and not had a second thought about it.

Well, obviously some of you don't know me very well.
I think I can pull a few more "Dirty Tricks" from out of my hat and turn this Dirty Littlle Routine into something totally scandalous. I'm not happy until the routine is so small that it pulls numbers from thin air, like magick...

Last Edited: Thu. Apr 13, 2006 - 08:24 AM

TITLE: DIRTY DAN"s MATH TRICKS: DIVISION-BY-TEN (Continued)

Let's re-examine what it is we're doing mathematically with our Dirty Little DIV10 Routine:

We're multiplying by 51, adding one, then dividing by two, and then truncating to get a final division of 256:

```A / 5 ~= [ [ ( A x 51 ) + 1 ] / 2 ] / 256
```

Let's ignore that final division by 256 because it's accomplished with ZERO program steps by ignoring the lower byte and there's no way we can improve on that unless there are opcodes that can move us backwards in time.

So we want to look closely at the right side of this equation:

```256 x [A / 5] ~= ( 51A + 1) / 2
256 x [A / 5] ~= 51A/2 + 1/2
```

Well 1/2 doesn't really mean much to a microprocessor, you either have a bit that is on or its off. We'll ignore it for now and if need be, we'll increment something in the end like we did with our DIV5 routine.

So now we have:

```256 x [A / 5] ~= 51A/2
256 x [A / 5] ~= 25.5 x A
256 x [A / 5] ~= 25A
```

Since we dropped a 1/2 earlier and now we're stuck with another 0.5 perhaps we should round the 25.5 upto 26 rather than truncate it down to 25.

```256 x [A / 5] ~= 25.5 x A
256 x [A / 5] ~= 26A
```

So to divide by 10 all we need to do is multiply by 26 and ignore lower byte:

```MUL10: LDI B,26
```

It sounds incredable, but let's check the results and see what our error ratio is:

``` 10 =>  1  Correct!
50 =>  5  Correct!
100 => 10  Correct!
150 => 15  Correct!
200 => 20  Correct!
250 => 25  Correct!
```

So it's 100% accurate over the range we need and it's only two program lines.

We started with rather large routines requiring 50 to 100 clock cycles per call and have something that runs in just three.

Can we reduce it even more?

Not really, not until Atmel comes out with a MULI command that allows us to multiply a register with an immedaite value rather than loading it into another register first.

Maybe next year...

It would probably look like this:

```DIV10: MULI A,26  ;ANSWER IN R1
```
Last Edited: Fri. Mar 31, 2006 - 09:14 AM

TITLE: DIRTY DAN"s MATH TRICKS: DIVISION-BY-TEN (Conclusion)

IN CONCLUSION:

By systematically applying a few Dirty Tricks and following our instincts combined with a little high-school math, we took some rather large routines and reduced them down into just a couple program steps. Something that many would say is impossible., but never say never my friend. If I can do it, so can you!

REQUEST FOR FEED BACK:

I hope you found this posting entertaining as well as educational.

If you would like to see more like it, please be sure to let the moderators know.

Thank you for your time and consideration.

Last Edited: Thu. Apr 13, 2006 - 08:27 AM
```MUL10: LDI B,26
```

Quote:

So it's 100% accurate over the range we need and it's only two program lines.

This is not true. For a 'div by 10' this is an approximation which will not yield correct results for all integers between 1 and 255 decimal (try for example 169, 159 ... for 'A' to see it fail). The 'scaled reciprocal' for a correct 'division by 10' of an 8 bit integer would have to be at least 9 bits to reach the required precision.
Please check out the mathematical basics for this technique at http://www.cs.uiowa.edu/~jones/bcd/divide.html

```                 DIV10a   DIV10b   LENZE
REGISTERS:         2        2       10
DIFFERENCE:      -8       -8        -
PERCENT:       -80%     -80%       -

PROGRAM WORDS:    5         7       36
DIFFERENCE:    -31        -29       -
PERCENT:      -86%       -80%      -

SPEED/CYCLES:     5         8       46
DIFFERENCE:    -41       -38        -
PERCENT:      -89%      -83%       -
```

are also incorrect: the '46 cycles' for 'Lenze' is including register saves (push/pop), the naked division routine w/o reg saves and getting the remainder is:

```	ldi	YH,0xCC		; scaled reciprocal for /10
ldi	YL,0xCD

; Q = A * 0xCCCD
; (r19:r18:r17[:rXX] = XH:XL * YH:YL)
clr     r2
mul     XH, YH		; ah * bh
movw    r19:r18, r1:r0
mul     XL, YL		; al * bl
mov	r17,r1		; r0 to [rXX] is superfluous
mul     XH, YL		; ah * bl
mul     YH, XL		; bh * al

; Q = Q >> 16: use r19:r18 as word
; Q = Q >> 3
lsr	r19		; do the last 3 shifts
ror	r18
lsr	r19
ror	r18
lsr	r19
ror	r18

; r19:r18 now "Q" (= result >> 19)
```

That's 25 cycles for a 'divide 16 bit number by 10' (result BTW is always correct over the full range 1 - 65535).

Andreas

Well now, Andreas... The argument could be made that RetroDan's solution is *more correct* than the traditional integer approach.

Of course, the "true" result is 16.9. But we cannot express 16.9 in an integer. So we round. The traditional approach for computer integer arithmetic is to always round down.

Mathemeticians would tell us that we should round anything from 0.0 up to 0.49999999... as 0, and anything from 0.500000000...01 up to 0.9999999999999999... as 1. There is a gray area for 0.5 exactly. I was taught a convention that in that case, you round to the even-numbered integer.

From that prespective, RetroDan's solution comes much closer to the "mathematically" correct solution. 169/10 = 17.

As for the amount of pushing and popping required... well, that's relative to the register usage conventions at hand. In the worst case (with all destoryed registers saved and restored), RetroDan's solution is still smaller and faster than the best-case (with no destroyed registers saved and restored) "Lenze" solution.

Last Edited: Fri. Mar 31, 2006 - 03:18 PM

Yup, actually I thought the *26 then ditch 8 bit thing was really clever. Obviously it works because 26/256 is 0.1015625 which for small numbers is OK as a way of multiplying by roughly 0.1. If you tried a larger value like 800 then I guess it'd be 800 * 26 = 20,800 and divided by 256 this is then 81 - so the extra 0.0015625 has come into play to create the error. But on small numbers it is a clever technique for an "integer divide"

In fact just checked and it's good for all values up to 630 in fact, the first one where it goes wrong is 631

Cliff

Quote:

From that prespective, RetroDan's solution comes much closer to the "mathematically" correct solution. 169/10 = 17.

Ah well, Luke - for '169', yes. But even if we assume an 'implicit rounding' scheme, this is not handled correctly, for ex. in the range 0-9 we get '0' as result up to dividend 9, ditto e.g. in the 50-59 range ('5' up to 59).

Mathematically correct - as to the mathematicans - would be the 'increment result' from '5' or '55' upwards. OP claims '100% correct' for the algorithm which it isn't.
It would be correct if it either were returning the remainder together with the result or providing the mathematically correct rounding you described.

Quote:
In the worst case (with all destroyed registers saved and restored), RetroDan's solution is still smaller and faster than the best-case (with no destroyed registers saved and restored) "Lenze" solution.

True, except for the fact that the OP is manipulatively comparing a 'sort of 8 bit integers division' (nice as long as you don't need correct results) with a 16 bit integers division (add another 6 cycles to get the remainder - blows it up to 31 ... geeeze ...) which, thanks to Mr Jones, does return correct results.

BTW: As this is the 'tutorial' forum (contents aimed at beginners) I honestly think that it would be more helpful if an approximation technique were explecitely labeled as such instead of trying to sell it off as real binary mathematics.

Andreas

alenze wrote:
BTW: As this is the 'tutorial' forum (contents aimed at beginners) I honestly think that it would be more helpful if an approximation technique were explecitely labeled as such instead of trying to sell it off as real binary mathematics.

I apologoize Alenze, as the comparison was not meant to in any way diminsh the excellent routines you have in your Div_xx Package. It is full of "tried and true" routine, and are backed by the work of Jones. If anyone is looking for "correct" routine I recommend them highly

However, I don't think you read the CAVEAT LECTOR warning in my first post. These postings are to discuss "Dirty Math Tricks" and hopefully have some fun at the same time. If anyone is uneasy about what I'm doing, I suggest they stick to the proven routines like the ones you and Atmel provide and forget my threads even exist.

"Closer to mathematically correct". I wasn't claiming that it was a miracle cure-all solution. In fact, it conveniently works out that way by pure coincidence of the range of numbers involved..

Anyway... The "traditional" approach to integer division (ignoring remainders for the moment), and Retro's "radical approach" to integer division are both wrong when compared to result using real numbers.

When compared to the "real number" result, the traditional results will always be less than or equal to the true result, with a maximum absolute error error of -1 LSB, and this maximum error is repeated consistently with a periodicity of 10 for all possible inputs.

RetroDan's results may differ from the "real number" result as well. The result may be greater than or less than the "real number" result. But it, too, will have a maximum absolute error of +/-1 LSB for the range of inputs we're interested in.

Note that the original intent of this thread was only to do 8-bit division by 10. And we actually only approach an absolute error of -1 LSB at the lower extreme with inputs of around 0, and +1 LSB at the upper extreme with inputs of around 630. In the middle of the range, the aboslute error will be limited to a band of +/- 0.5 LSB

In light of these results, I don't think it's fair to label his as "more wrong" than the traditional results.

Of course, if you want to have access to the remainders so that your application can fine-tune its interpretation of the result, then that's a strong argument in favour of some other approach.

Finding remainders:
- Multiply the result by 10. Label as 'x'.
- Subtract 'x' from the original input.
- This result is equal the the "Remainder". Using RetroDan's code, the remainder can either "positive" or "negative", but it will always have a magnitude of 10 or less. If negative, then feel free to subtract 1 from the quotient and add 10 to the remainder. And presto! You have a result that is totally identical to the "Lenze" result, for all possible 8-bit inputs.

```; RetroDan's original code
DIV10: LDI B,51
MUL A,B
INC R1   ;R1=A/5
LSR R1   ;R1=(A/5)/2 = A/10
; Luke's insertion
MOV C, R1
LDI B, 10
MUL C, B
SUB A, R0
; 'A' contains the (signed) remainder.
; sign of the remainder indicates whether it is an:
; - under-estimate (positive remainder), or
; - over-esitmate (negative remainder)
; 'C' contains the quotient
; Now a 100% _mathematically_ correct result for all 8-bit inputs.
; (Destroys the original input operand 'A'.
;  one could return the remainder in register
;  'B' at the cost of one extra MOV instruction.)
RET```

Regarding the division by 5 routine you said:

Quote:
Hmm, we're alway off by minus one, so if we add one, we should be right on the money!

But the only numbers that you checked were ones that were actually divisible by five. In fact, these are the only numbers where this routine returns the correct answer. If someone wants to use this, then some correction would definitely be necessary. What I would suggest is to not add the one, then find the remainder as lfmorrison points out. In this case the remainder will always be in the range of [0, 5]. If the remainder is 5, add one to the quotient and make the remainder 0. Otherwise, the quotient and the remainder are correct.

Applying this to the divide by 10 (the '* 51' version, not the '* 26' version), again if you drop the 'add one', the remainder is always in the range of [0, 10]. So again the only adjustment is for a remainder of 10.

Keep in mind that these remainder ranges are only valid if the original number is 8 bit. I quickly goes awry for larger numbers. Even 261 produces an out of range remainder. You could subtract 10 (or 5 for the div by 5) from the remainder instead of just making it 0 to extend the range. To be more specific, it will be correct as long as the answer fit in 8 bits.

Regards,
Steve A.

The Board helps those that help themselves.

One way to do this (sorry, lfmorrison's signed remainder isn't always mathematically correct - "we should round anything from 0.0 up to 0.49999999... as 0, and anything from 0.500000000...01 up to 0.9999999999999999... as 1":

```; Division by 10, useing reciprocal multiplication
;
; Call with:
;	8 bit dividend in r16
;
; Returns:
;	Result in r16
;	Remainder in r18

mov	r18, r16	; save original dividend
ldi	r17,26		; reciprocal, scaled *256, off a bit
mul	r16,r17

; result of multiplication now in r1:r0
; use only r1, thereby effectively divide by 256
dec	r1		; if imprecise scaling value influences result,
; result will be '+1'. Decrement to avoid neg.
; value in later subtraction
ldi	r17,10		; re-use r17 for divisor
mov	r16,r1		; save result/256
mul	r16,r17		; find remainder by multiplication of result by 10d

; result again in r1:r0
; get remainder and correct result of 'by 10'
sub	r18,r0
cpi	r18,10
brlo	done
subi	r18,10
inc	r16

done:
; result in r16, remainder in r18

;optional rounding:
cpi	r18,5
brlo	done_r
inc	r16
done_r:

```

(Quick hack, up to two or three cycles could be shaved off)

Andreas

I'm going to start with an apology here, because I might be thinking differently, and appear to be stiffling a very simulating and valuable discussion. However as a tutorial, I feel this thread is misplaced. It should be in the main forum.

My thoughts of a tutorial are as an introduction, a quickstart guide, a pointer to get going and excited. It is a place for Newbies to look and gain confidence, and make things. Get involved and learn. Good examples to simply use (and with luck considered later), just standard routines.

If my manager came up and said, "Eric, we stock a 4k device, the next part up is \$5 and that will take us under", then I would be fantastically interested in this thread. But If I had just found this site, looking for information, I might be scared away by the academic discourse of the moment. I'm looking for gentle introductions to do this, do that. If the device cannot divide by 10, and I cannot comprehend the details that experienced people are debating, then I'm off. Another Newbie scared away.

Engineering is about compromises."Gerald!, if you can do it in a day just get on with it.". Versus, "A new part in our inventory, can't you make it smaller". Such pragmatic stuff is key to success, but we have to be careful about introducing it here (apart from sowing thoughts).

I hope my opinion is understood in a positive sense. It is only one view among so many good people.
Cheers All.

Frankly, I don't buy that "negative remainder isn't correct" argument.

It's a non-standard representation. So are improper fractions. But just like improper fractions, the result is numerically dead-on.

AFAIC, 25 divided by 4 can be perfectly legally expressed as 6 remainder 1. And that is exactly identical to 5 remainder 5, or 7 remainder -3.

The only possible result from RetroDan's expression would still be guaranteed to be within 1 LSB of the true result. And the "traditional approach" cannot offer any better than getting within 1 LSB either.

But if it really bothers you, then you could always test the sign bit of the remainder... if it is set, then subtract 1 from the quotient and add 10 to the remainder.

```DIV10: LDI B,51
MUL A,B
INC R1   ;R1=A/5
LSR R1   ;R1=(A/5)/2 = A/10
; Luke's insertion
MOV C, R1
LDI B, 10
MUL C, B
SUB A, R0
; 'A' contains the (signed) remainder.
; sign of the remainder indicates whether it is an:
; - under-estimate (positive remainder), or
; - over-esitmate (negative remainder)
; 'C' contains the quotient
; We can return here with a totally valid and numerically correct result.
; But for the sticklers among us, let's convert it
; to a "cannonical form".
SBRS A, 7 ; test the sign of result.
RJMP PositiveRemainder
; We have a negative remainder; subtract 1 from
; the quotient and add 10 to the remainder.
DEC C
PositiveRemainder
; The result is now in perfect "cannonical" form.
RET```

And given the limited range of values we're concerned with here, that result (after a single check of the sign on the remainder, and possibly a single subtraction of 1 / addition of 10) will be in the "cannonical" form. (Ie. a guaranteed underesitmate in the result, and a remainder that is guaranteed to be positive, and bound between 0 and 9.)

Broxbourne: I agree 100%. The give and take on this subject should be hashed out in another forum. Then and ONLY then should it become a tutorial. After all, what is a beginner to think when seeing the differing opinions. Most beginners are in no position to judge who is right or wrong on the details. I would like to see it here once that is done.

Rick

Broxbourne wrote:
I'm going to start with an apology here, because I might be thinking differently, and appear to be stiffling a very simulating and valuable discussion. However as a tutorial, I feel this thread is misplaced. It should be in the main forum.

I hope my opinion is understood in a positive sense. It is only one view among so many good people.
Cheers All.

Thanks for the feed-back.

You may not know this, but actual tutorials are labeled [TUT] by moderator to distiguish them from other threads. The moderator has labeled this thread [DIS] for discussion. Another label is [CODE] for code segments that are not tutorials. I think this information can be found in the "Sticky-the-Bear" threads at the top of the list.

Dan, Thank you for the information. I did not realize the moderator was adjusting the titles of the thread according to the intended audience. I just thought this was the place to find tutorials and good examples of code and AVR applications.
Thanks for clarifying that. I'll make sure I only read [TUT] postings from now on to avoid getting confused by the more heady academic stuff, although I am still surprised to see such specialised discussion in a beginner forum.
Practically Yours.

Hi Luke,

Quote:
sorry, lfmorrison's signed remainder isn't always mathematically correct - "we should round anything from 0.0 up to 0.49999999... as 0, and anything from 0.500000000...01 up to 0.9999999999999999... as 1"

definitely didn't come through quite as intended (sorry about that):

Quote:
Frankly, I don't buy that "negative remainder isn't correct" argument

You are right and I phrased it badly: of course the signed remainder is correctly calculated (cleaner than in my example, too - 3 or four cycles saved).

It's the 'inbuild rounding' whose threshold is off by one, mathematically speaking ( :) ):

The quotient 'n.5' (6.5 for example if dviding 65/10) should already be rounded up to n+1 ('7'), here it is still left at '6'.
66/10 (6.6) is correctly rounded to '7' as 64/10 (6.4) becomes a rounded '6'. It's just the '.5' rounding which doesn't quite work (aren't those mathematicians a fascinating breed?).

Sorry again - apologies to Broxbourne, too, guess we got carried away a little here ...

Andreas

Thank you all for the wonderful feed-back. The information was invaluable and probably saved me a lot of head-scratching as you'll soon find out.

I was pleased-as-punch that my "quirky" little division by 10 routine worked. I was even more delighted to find out that it did a pretty good job of rounding to boot. However, when I got around to using it for my original purpose, I realized that rounding was the last thing I wanted.

The original use for the routine was to break-down an unsigned single byte for display on the Butterfly LCD. Well I was getting results like: 74, 75, 76, 77, 80, 80, 80, 81, 82, 83. Obviously the little DIV10 routine was working over-time and rounding, so a few fine adjustments were needed to "UN-ROUND" the results.

I know that you guys have already worked out your own versions, here is mine:

```;-----------------------------------;
; RETRO (SYNTHETIC) DIVISION BY 10  ;
; ANSWER IN R1, R0=REM, A:PRESERVED ;
;-----------------------------------;
DIV10:  PUSH B
LDI  B,26   ;MUL BY 26
MUL  A,B    ;R1=A/10
PUSH R1     ;BRUTE-FORCE CALC OF REMAINDER
LDI  B,10   ;CALC REM
MUL  R1,B   ;R0=10xR1(QUOT)
POP  R1     ;RESTORE QUOT
SUB  R0,A   ;SUBTRACT REMx10
NODJST: NEG  R0     ;MAKE POSITIVE
BRPL NONEG ;STILL NEG?
NONEG:   RET
```

If you're wondering why I calculate the remainder "backwards" then negate it, it's to preserve the value of A. If there's no need to preserve A then the remainder could be calculated the other-way-around and eliminate the need for the NEG R0 line. However, you should be forewarned that "letting go" a NEG R0 without very good cause could lead to a charge of racism! Other than removing that one statement I don't know if it can be optimized any further, but it sure would be nice.

this topic got me thinking a while ago.

I really liked the super simple idea of a divide by ten routine, but didn't like the error it has. So i was thinking a little further about the routine with the multiplier.

I was thinking the error came because of the difference between 26 and 25.6, and started thinking about a way to make this error smaller, so i came up with the idea of 65536/10 which would make 6553.6 or 6554 which has less error.

I wrote a test program on the ti83 to verify my idea, and found that it is correct up to 14 bits.
listing:

```:ClrHome
:0->G
:0->F
:6554->B
:For(A,1,20000)
:If (iPart(A/10)=iPart((A*B)/65536))
:Then
:G+1->G
:Else
:F+1->F
:If F<4
:Output(F+2,1,A
:End
:Output(1,1,G
:Output(2,1,F
:End
```

This told me the first error would occur at 16389.

So up to 14 bit accurate!
Enough for a 8 bit number

I wrote me some asm and came up with:

```div10:
; put your value in temp
;	ldi temp, 11					 ; example value is in temp
ldi mul_constL, low(6554)
ldi mul_constH, high(6554)			; load multiply value
mul temp, mul_constL				; mul lower
movw result_M2:result_L2, R1:R0		; store
mul temp, mul_constH				; mul higher
clr	result_H2
ret
; result is now in result_H2
```

The code above is hogging up to much vars. So i optimized it a little and came up with:

```div10:
; input value in temp
ldi tempL, low(6554)
ldi tempH, high(6554)				; load multiply value
mul temp, tempL						; mul lower
mov tempL, R1						; store
mul temp, tempH						; mul higher
clr tempH
ret
; result is now in tempH
```

this is better. :D

I tried running your 8 bit division code:

Quote:
;***************************************************************************
;* "div8u" - 8/8 Bit Unsigned Division
;*
;* This subroutine divides the two register variables "dd8u" (dividend) and
;* "dv8u" (divisor). The result is placed in "dres8u" and the remainder in
;* "drem8u".
;*
;* Number of words :14
;* Number of cycles :97
;* Low registers used :1 (drem8u)
;* High registers used :3 (dres8u/dd8u,dv8u,dcnt8u)
;***************************************************************************

.def drem8u =r15 ;remainder
.def dres8u =r16 ;result
.def dd8u =r16 ;dividend
.def dv8u =r17 ;divisor
.def dcnt8u =r18 ;loop counter

div8u: sub drem8u,drem8u ;clear remainder and carry
ldi dcnt8u,9 ;init loop counter
d8u_1: rol dd8u ;shift left dividend
dec dcnt8u ;decrement counter
brne d8u_2 ;if done
ret ; return
d8u_2: rol drem8u ;shift dividend into remainder
sub drem8u,dv8u ;remainder = remainder - divisor
brcc d8u_3 ;if result negative
clc ; clear carry to be shifted into result
rjmp d8u_1 ;else
d8u_3: sec ; set carry to be shifted into result
rjmp d8u_1

When I build this code, it gives me an error: "Register already defined by the .def directive". The above code refers to R16 with two different labels which should be okay. Is there a reason why it gives me this error?

Thanks,
John

I've gotten this error recently as well. I think that it used to work, but it has changed in recent versions of AVR Studio. Not sure if this is a bug or a deliberate change.

Regards,
Steve A.

The Board helps those that help themselves.

First, let me say that I have no clue about AVR coding. An Arduino board is in the mail, but I haven't gotten it yet. I love digital math, though. So here's a little different take on the OP's trick.

Dividing is the same thing as multiplying by the inverse, so dividing by 10 is the same as multiplying by 0.1. The problem is that there isn't a neat binary representation of 0.1. But it turns out that you can use a good enough 16-bit representation of 0.1 to get accurate results (meaning the same number as division by 10, rounded down as expected). With some simple fixed-point math you can then do your multiply, discard the mantissa (the bit after the . (or , for Europeans)) and there you go.

What fixed-point format to use? It'll be obvious in a minute. Let's just assume 8 bits of mantissa for now and start finding the binary representation of 0.1d (decimal). Really what you're looking for is which powers of -2 you need to add up to get as close to 0.1 as possible. If you do that, you find:
1/10 =~ 0/2 + 0/4 + 0/8 + 1/16 + 1/32 + 0/64 + 0/128 + 1/256
the binary representation for this is:
00000000.00011001

So that's the 16-bit number you should multiply the original number by to divide by 10. But you'll notice we can use extra mantissa bits, as long as the top byte is 0. (If the top byte isn't 0 we might get overflow when we multiply.) So let's use 11 mantissa bits:
1/10 =~ 0/2 + 0/4 + 0/8 + 1/16 + 1/32 + 0/64 + 0/128 + 1/256 + 1/512 + 0/1024 + 1/2048
or 00000.00011001101

Here it is all nicely coded up in C. I've tested it for 0-255 and it gives the same result as C's /10 operation for every one. gcc did some awful things to the assembly, but I'll leave that part up to you guys because you're much better at it than I am.

```uint8_t div10_fp(uint8_t number)
{
// 1/10th with 11 mantissa bits
// 1/10th = 0/2 + 0/4 + 0/8 + 1/16 + 1/32 + 0/64 + 0/128 + 1/256 + 1/512 +
// 0/1024 + 1/2048
// 0.00011001101
uint16_t tenth = 0xcd;
// result with 11 mantissa bits
uint16_t result = number * tenth;
9a:   99 27           eor     r25, r25
9c:   2d ec           ldi     r18, 0xCD       ; 205
9e:   30 e0           ldi     r19, 0x00       ; 0
a0:   82 9f           mul     r24, r18
a2:   a0 01           movw    r20, r0
a4:   83 9f           mul     r24, r19
a6:   50 0d           add     r21, r0
a8:   92 9f           mul     r25, r18
aa:   50 0d           add     r21, r0
ac:   11 24           eor     r1, r1
ae:   ca 01           movw    r24, r20
// Drop mantissa to round down.
return result >> 11;
b0:   89 2f           mov     r24, r25
b2:   99 27           eor     r25, r25
b4:   86 95           lsr     r24
b6:   86 95           lsr     r24
b8:   86 95           lsr     r24
}
ba:   99 27           eor     r25, r25
bc:   08 95           ret
```

You can do the same thing for dividing by any constant, and I'm actually kind of surprised gcc doesn't just do this for me.

Tim

That is a great idea froody!

I had a look at your code, and optimized it a bit.

```; the input number must be in register "input"
; value/10 is in register "result"
; register temp1 and R1:R0 are clobbered
div10:
ldi temp1, 205
mul temp1, input
lsr R1
lsr R1
lsr R1
mov result, R1
ret
```

edit...

i have written a little test program in ti basic.

```:ClrHome
:0->G
:0->F
:For(N,0,1200)
:If (iPart(N/10)=iPart((N*205)/2048))
:Then
:G+1->G
:Else
:F+1->F
:If F<5
:Output(2+F,1,N
:End
:Output(1,1,"G="
:Output(1,3,G
:Output(2,1,"F="
:Output(2,3,F
:End
```

It shows that it is correct up to 1028 and it fails at 1029, more than enough for 8 bit

brum, wow. gcc really sucks, huh? Nice job on optimizing the code. I'm going to install gcc 4.1 soon. Hopefully it won't suck as much.

Meanwhile, I clarified my original post a bit, and wrote some more, and now it's turned into a little article:

How to divide an unsigned 8-bit integer by a constant.
(c) 2006 by Tim Newsome
Thanks to RetroDan and several other on avrfreaks.net for the inspiration.

Dividing is slow, so let's avoid that. Instead, multiply by the inverse. You can write the inverse as a fixed-point binary number. Just like each digit in a normal binary number represents a power of 2, each digit after the decimal point in a binary number represents a power of -2. So:
10.11 = 1 * 2^1 + 0 * 2^0 + 1 * 2^-1 + 1 * 2^-2 = 2 + 1/2 + 1/4 = 2.75
Example: 0.5 = 1/2 = 0.1 (binary, 1-bit mantissa)
Example: 0.25 = 0/2 + 1/4 = 0.01 (binary, 2-bit mantissa)

Unfortunately there isn't a precise binary representation of most fractions, so you have to use an approximation. The same problem happens in decimal numbers, just not as often. For instance, 1/3 can be approximated as 0.3, which has an error of 0.3-1/3=-1/30. A better approximation is 0.33, which has an error of -1/300. The more digits you add, the smaller the error becomes. The same thing is true in binary:
1/3 =~ 1/2 = 0.1, error = 0.167
1/3 =~ 0/2 + 1/4 = 0.01, error = -0.0833
1/3 =~ 0/2 + 1/4 + 1/8 = 0.011, error = 0.0417
1/3 =~ 0/2 + 1/4 + 0/8 + 1/16 = 0.0101, error = -0.0208

Because we can't always find the precise inverse, we need to find one that's close enough, and gives us the right answer by rounding down always. (Rounding down just means ignoring the mantissa bits, which is easy to do.) That means we want to find an approximation that is at least as big as the real inverse. In other words, the error (computed as approximation minus real value) has to be positive. Furthermore, the error multiplied by 255 (our largest input value) must be smaller (less negative) than 1/divisor. For dividing by 3, the error must lie between 0 and 1/(3*255). Let's rework 1/3 keeping the error positive, and seeing how many bits we need before the error is small enough that we can use the approximation:
1/3 =~ 1/2 = 0.1, error = .167
1/3 =~ 1/2 + 0/4 = 0.10, error = .167
1/3 =~ 0/2 + 1/4 + 1/8 = 0.011, error = .0417
1/3 =~ 0/2 + 1/4 + 1/8 + 0/16 = 0.0110, error = .0417
1/3 =~ 0/2 + 1/4 + 0/8 + 1/16 + 1/32 = 0.01011, error = .0104
1/3 =~ 0/2 + 1/4 + 0/8 + 1/16 + 1/32 + 0/64 = 0.010110, error = .0104
1/3 =~ 0/2 + 1/4 + 0/8 + 1/16 + 0/32 + 1/64 + 1/128 = 0.0101011, error = .00260
1/3 =~ 0/2 + 1/4 + 0/8 + 1/16 + 0/32 + 1/64 + 1/128 + 0/256 = 0.01010110, error = .00260
1/3 =~ 0/2 + 1/4 + 0/8 + 1/16 + 0/32 + 1/64 + 0/128 + 1/256 + 1/512 = 0.010101011, error = .000651

So 9 bits of mantissa is enough, and our inverse is 0.010101011. Here is some C code that takes advantage of this:

```uint8_t div3(uint8_t number)
{
// Compute the intermediate value, with 9 mantissa bits.
uint16_t intermediate = (uint16_t) number * 0xab;
// Return the result, chopping off the mantissa bits.
return intermediate >> 9;
}```

A good compiler should compile that down to a single multiply instruction, a shift, and maybe a few extra instructions. This is drastically faster than using the generic divide code.

Let's do one more example, for division by 10. First figure out the inverse using the fewest number of mantissa bits:
1/10 =~ 1/2, error = 0.400
1/10 =~ 0/2 + 1/4, error = 0.150
1/10 =~ 0/2 + 0/4 + 1/8, error = 0.025
1/10 =~ 0/2 + 0/4 + 1/8 + 0/16, error = 0.025
1/10 =~ 0/2 + 0/4 + 1/8 + 0/16 + 0/32, error = 0.025
1/10 =~ 0/2 + 0/4 + 0/8 + 1/16 + 1/32 + 1/64, error = .00938
1/10 =~ 0/2 + 0/4 + 0/8 + 1/16 + 1/32 + 0/64 + 1/128, error = .00156
1/10 =~ 0/2 + 0/4 + 0/8 + 1/16 + 1/32 + 0/64 + 1/128 + 0/256, error = .00156
1/10 =~ 0/2 + 0/4 + 0/8 + 1/16 + 1/32 + 0/64 + 1/128 + 0/256 + 0/512, error = .00156
1/10 =~ 0/2 + 0/4 + 0/8 + 1/16 + 1/32 + 0/64 + 0/128 + 1/256 + 1/512 + 1/1024, error = .000586
1/10 =~ 0/2 + 0/4 + 0/8 + 1/16 + 1/32 + 0/64 + 0/128 + 1/256 + 1/512 + 0/1024 + 1/2048, error = .0000977
Finally we've reached the point where 255 * error is less than 1/10. So we need 11 mantissa bits, and our inverse is 0.00011001101. Here's the C code:

```uint8_t div10(uint8_t number)
{
// Compute the intermediate value, with 11 mantissa bits.
uint16_t intermediate = (uint16_t) number * 0xcd;
// Return the result, chopping off the mantissa bits.
return intermediate >> 11;
}```

So there you have it. Fast division by a constant of an unsigned 8-bit integer. If you followed what I've done here, you should be able to extend this to 16-bit or any other sized number.

Last Edited: Sun. Oct 22, 2006 - 10:50 PM

Froody,

Good article. However, in your code examples don't you need an explicit cast to the "number" parameter to force GCC to use 16-bit operations? I would expect GCC to do the multiplication as 8-bit first, then implicit cast to a 16-bit value when storing.

- Dean :twisted:

Make Atmel Studio better with my free extensions. Open source and feedback welcome!

abcminiuser wrote:
Good article. However, in your code examples don't you need an explicit cast to the "number" parameter to force GCC to use 16-bit operations? I would expect GCC to do the multiplication as 8-bit first, then implicit cast to a 16-bit value when storing.

Thanks. Good point about the typecast. I think it's OK, because the hex constant is implicitly of type int, which is 16 bits, on AVR (right?). So it all works out. But it certainly is safer to add the typecast, so I'll edit my post and stick it in there.

Tim

Quote:
You can do the same thing for dividing by any constant, and I'm actually kind of surprised gcc doesn't just do this for me.
...
gcc really sucks, huh?

I don't know how you can come to this conclusion from this thread. Up until your post this entire thread has been about assembly language and therefore has absolutely nothing to do with gcc. Also, the C language has absolutely no knowledge about fixed point numbers, so how could you possibly expect it to optimize your code in this manner?

Regards,
Steve A.

The Board helps those that help themselves.

Koshchi wrote:
Quote:
gcc really sucks, huh?

I don't know how you can come to this conclusion from this thread. Up until your post this entire thread has been about assembly language and therefore has absolutely nothing to do with gcc. Also, the C language has absolutely no knowledge about fixed point numbers, so how could you possibly expect it to optimize your code in this manner?

In the first post you quoted, I showed some C code among the assembly code. You can see that gcc's generated code is really inefficient. For instance it does a mul by a register known to be 0. (Note that I've since installed gcc 4.1.1 and it does a lot better, although there are still some obvious optimizations possible.)

Look at this C code:

```uint8_t function(void) {
volatile uint8_t val = 123;
return val / 10;
}```

When I compile it, I get a call to a general divide routine. But the compiler knows I'm dividing by a constant, so it could do the exact optimization I outlined. Effectively it would rewrite the code to look like:

```uint8_t function(void) {
volatile uint8_t val = 123;
return ((uint16_t) val * 0xcd) >> 11;
}```

That is actually slightly larger than the original code, but a lot faster. If you really care about code size you could just use the div10() function. Then function() would be 1 instruction shorter (because you don't have to pass 10 to the divide function) but you do take the size penalty of div10() (10 instructions in my gcc 4.1.1 build).

Tim

Try removing the "voltatile" keyword. You'll get a far better output.

There are pointy haired bald people.
Time flies when you have a bad prescaler selected.

daqq it's an old thread, nearly 3/4 of an year old
:)

/Safari

Still, I'm glad he brought it back since this will help me.

Edward

Nothing like a nice jolt when I saw who the OP was! :shock: Then I realized it's an old thread. :P

Wouldn't it be nice to not deal with the rounding error? If we were to add a compensation amount before multiplying by 26, we then have no error:

```.def temp1 = r16
.def temp2 = r17
.def A = r18

// Input: A
// Output: r1
DivBy10:

// calculate the amount needed to add to A to compensate for the rounding error to come
// compensation = (A + 3) / 64
mov temp2, A
ldi temp1, 3
// "divide by 64" shortcut
swap temp2
andi temp2, 15
lsr temp2
lsr temp2

// now use 'A - compensation'
mov temp1, A
sub temp1, temp2

// Multiply by magic number
ldi temp2, 26
mul temp1, temp2

ret```

I've checked it out on my computer and seems like this works good, and doesn't eat many more cycles than the original.

Quote:

```; the input number must be in register "input"
; value/10 is in register "result"
; register temp1 and R1:R0 are clobbered
div10:
ldi temp1, 205
mul temp1, input
lsr R1
lsr R1
lsr R1
mov result, R1
ret ```

This one rocks! Saved my day, today :)

i suppose one could allocate a huge look up table ln FLASH on the larger AVR range ... practically instant result which i have done many many years ago... (code lost to time)

Thanks to all for alively discussion. I need help on a 16 bit divide:
I have been using the DIV16u straight from apnote 1011 (AVR200) since 1998, and found a possible error.

When you have \$7080 (28,800) divided by \$3841 (14,401), the correct answer is: \$0001 and remainder \$FFF6 (1.99986).
However, the Atmel DIV16u returns \$0001 remainder \$383F, quite an error.

When you have \$7080 (28,800) divided by \$3840 (14,400), the correct answer is: \$0002 and remainder \$0000 (2.00000), and the Atmel code returns the correct same answer.

Vern Bunch, AVR consultant and AVR designer

<p>Vern</p>

vbunch wrote:
When you have \$7080 (28,800) divided by \$3841 (14,401), the correct answer is: \$0001 and remainder \$FFF6 (1.99986).
However, the Atmel DIV16u returns \$0001 remainder \$383F, quite an error.

Dude, WTF?

Let's do this manually, by long division.

\$3841 goes into \$7080 once.

```(hex assumed)
____1___
3841 ) 7080
3841 -
----
383F
```

\$3841 won't go into \$383F, so the quotient is 1 and the remainder is \$383F. The Atmel routine returns the correct answer.

The number \$FFF6 might possibly occur in some floating point representation, but can never happen in unsigned integer math, since it is larger than the dividend.

Oh, I see where you got it from.

If you want to generate the fractional bits, take the remainder, multiply by 16 (shift left 4) and divide this by the divisor. Repeat until you have enough "(hexa)decimal" places.

```      ____1FF...
3841 ) 7080
3841 -
----
383F <<4
3841 * F -
----
3821 <<4
3841 * F -
----
3641
... etc
```

Again, this is just confirming that the original division routine got it right. You didn't carry out enough steps to format the answer the way you expected.

Note that since the first remainder is guaranteed to be less than the divisor, each successive step is guaranteed to produce a quotient less than 16, ie a single hex digit. You have to manually concatenate these single digits to the number of places you want.

Thank you my friends for your correct responses. It is my fault that I mentioned "remainder" when I expected a fully usable result, without having to divide again, and one that had 16 bit accuracy. So the Atmel division routine is "correct" but unusable without more computation. Since there was no mention that the result is unusable without further division, I niavely skipped the true math definition of "remainder" and went right to "answer", I should not have done that.

I have since written an assembly language routine that gives me a correct (and directly usable) 16 bit division in one run, and it is less time (156 cycles) then the published 16 bit divide routine. It took 3 weeks at 104 hours a week, and I'm way beyond tired. :D

<p>Vern</p>

Quote:

I have since written an assembly language routine that gives me a correct (and directly usable) 16 bit division in one run, and it is less time (156 cycles) then the published 16 bit divide routine. It took 3 weeks at 104 hours a week, and I'm way beyond tired

I guess it's too late to say that you could have "borrowed" the source for a / routine from an open source C compiler?

Quote:
I guess it's too late to say that you could have "borrowed" the source for a / routine from an open source C compiler?

Actually I suspect in this case he could not, because the answer he wanted was in a fixed point format. He got simple integer division from the appnote. The compiler would have given him full float, which would challenge me to adapt to a fixed format in assembler (and I consider myself rather good at it).

As a matter of interest, if we did look at the source for a / routine in an open source C compiler - let's say, and why not, the GCC compiler - would we be looking at assembler, or C?

Quote:

would we be looking at assembler, or C?

Assembler. Here's a snippet from gcc/config/avr/libgcc.s:

```/*******************************************************
Division 16 / 16 => (result + remainder)
*******************************************************/
#define	r_remL	r26	/* remainder Low */
#define	r_remH	r27	/* remainder High */

/* return: remainder */
#define	r_arg1L	r24	/* dividend Low */
#define	r_arg1H	r25	/* dividend High */

/* return: quotient */
#define	r_arg2L	r22	/* divisor Low */
#define	r_arg2H	r23	/* divisor High */

#define	r_cnt	r21	/* loop count */

#if defined (L_udivmodhi4)
.global	__udivmodhi4
.func	__udivmodhi4
__udivmodhi4:
sub	r_remL,r_remL
sub	r_remH,r_remH	; clear remainder and carry
ldi	r_cnt,17	; init loop counter
__udivmodhi4_loop:
rol	r_remL		; shift dividend into remainder
rol	r_remH
cp	r_remL,r_arg2L	; compare remainder & divisor
cpc	r_remH,r_arg2H
brcs	__udivmodhi4_ep	; remainder < divisor
sub	r_remL,r_arg2L	; restore remainder
sbc	r_remH,r_arg2H
__udivmodhi4_ep:
rol	r_arg1L		; shift dividend (with CARRY)
rol	r_arg1H
dec	r_cnt		; decrement loop counter
brne	__udivmodhi4_loop
com	r_arg1L
com	r_arg1H
; div/mod results to return registers, as for the div() function
mov_l	r_arg2L, r_arg1L	; quotient
mov_h	r_arg2H, r_arg1H
mov_l	r_arg1L, r_remL		; remainder
mov_h	r_arg1H, r_remH
ret
.endfunc
#endif /* defined (L_udivmodhi4) */

#if defined (L_divmodhi4)
.global	__divmodhi4
.func	__divmodhi4
__divmodhi4:
.global	_div
_div:
bst     r_arg1H,7	; store sign of dividend
mov     __tmp_reg__,r_arg1H
eor     __tmp_reg__,r_arg2H   ; r0.7 is sign of result
rcall	__divmodhi4_neg1 ; dividend negative : negate
sbrc	r_arg2H,7
rcall	__divmodhi4_neg2 ; divisor negative : negate
rcall	__udivmodhi4	; do the unsigned div/mod
rcall	__divmodhi4_neg1 ; correct remainder sign
tst	__tmp_reg__
brpl	__divmodhi4_exit
__divmodhi4_neg2:
com	r_arg2H
neg	r_arg2L		; correct divisor/result sign
sbci	r_arg2H,0xff
__divmodhi4_exit:
ret
__divmodhi4_neg1:
brtc	__divmodhi4_exit
com	r_arg1H
neg	r_arg1L		; correct dividend/remainder sign
sbci	r_arg1H,0xff
ret
.endfunc
#endif /* defined (L_divmodhi4) */```

Just for fun, I have castrated :D the OP's div_10 code even further:

```;----------------------------------------
; 8-bit division by 10 routine.
;
; Input: R0 = dividend.
; Output: R1 = result = dividend/10.
;
; Registers R0 and R16 are clobbered.
; 5 words/10 clocks including RET.
;
div10:
ldi	r16,205
mul	r0,r16
ldi	r16,32
mul	r1,r16
ret
;----------------------------------------
```

Warning: Grumpy Old Chuff. Reading this post may severely damage your mental health.

The OP's div_10 code was buggy at values like 19, 29, 39, ... or at 21, 31, 41, ... depending on whether the correction (INC) was done or not. Here's a fixed version, with remainder calculation (remainder lines marked, strip them if not needed, saves 5 cycles):

```// input: A, output/value: R1, output/remainder: A
// total cycles: 13
PUSH A		//2 Me (for remainder)
INC A		//1 Me
CPI A, 0	//1 Me
BRNE x		//1,2 Me
DEC A		//1 Me
x:			//0 Me
LDI B, 51	//1 RetroDan
MUL A, B	//2 RetroDan
LSR R1		//1 RetroDan
POP A		//2 Me (for remainder)
SUB A, R1	//1 Me (for remainder)
```

Hope this is useful for someone out there.

My approach:

I made a list with values, RetroDan and RetroDan with INC. It looked somehow like this:
8 - 0 0
9 - 0 1
10 - 0 1
11 - 1 1

So one was too low, the other too high around exact matches (i.e. dividable by 10 with no remainder). So I took the one without INC (saves one cycle), INCreased the value before giving it to the RetroDan stuff, so all results were accurate - but "255" which becomes 0.

A quick check for the overflow after increasing (i.e. value=0), decrease it again (make it 255 again, which gives the correct result) and done.

P.S: I know it's an old thread, but I think it's an important one that's read by a lot of people for finding a good divide-by-10 routine.

Markus

But I would ask at first, whats the purpose to do a division by 10?

Mostly it was the decimal output conversion.
The fastest way to do so, was the subtraction method, since no divison nor multiplication was needed.
A small example for 16 bit ASCII output can be seen on:

https://www.avrfreaks.net/index.p...

Peter

Quote:
The fastest way to do so, was the subtraction method, since no divison nor multiplication was needed.
But repeated subtraction that that routine uses is division. And the multiplies when done with the hardware multiply are very fast.

Regards,
Steve A.

The Board helps those that help themselves.

Sorry for resurrecting such and old thread again but I just found it and feel I have a slightly different approach that may work better in the general case.

I have just finished (almost, still compile time to go!) doing a AC waveform generator. I went to great pains to avoid a single floating point number, and then bent over backwards to avoid any divides so that the needed library functions wouldn't be added in. Probably I will have acres of flash left that the math libraries could easily fit into, but it still feels like doing a better or at least 'artsier' design to use all fixed point arithmetic to do the variable frequency work. I went all out and 32bit index to step through a 512 entry array by right shifting by 23 first, and also did ramp function that further changed the step size by (0-127) then >>7 on the full number to keep maximum fractional accuracy. With 23 bits below the radix point the sine wave will be as step free as the PWM can make it.

But then it came time to display the current, actual frequency based on the ramped step size and I had to divide 32 fixed point integer step size by some insane number like 549756. And I didn't want to use any divides, and I was concerned that doing a multiply by 1/549756 would either overflow or lose too much accuracy. So I calculated the whole thing using only right shifts of the original number. Since this thread was supposed to at least partially tutorial, here is the general, iterative approach I used.

Take the bizarre number you want to divide by and get the 1/X. For the above example that is 0.100000000, for mine it is 0.00000181898878775.  Start off with an approximation, you can even start with zero. The subtract the approximation from the desired floating point value, y = 0.1000000 - 0, then take 1/y = 10 and pick the nearest value of 2^n, in this case 8=2^3 makes a good start. So now the approximation is 0.0 + 1/2^3 = .125000000. Now we repeat: y = 0.10000000 - 0.1250000 = 0.0250000; 1/y = -40 because our first approximation was too high. For the next level subtracting 2^5 = 32 looks good. So now we have our approximation 0.0 + .1250000 - 0.0312500 = 0.09375. Already we are within 10%. Now repeat again: y = 0.1000000 - 0.0937500 = 0.00625; 1/y = 160, so 2^7=128 is a close value. Now the approximation is 0 + 0.125 - 0.03125 + 0.0078125 = 0.1015625 and now we are about 1% off. Again: 0.1000 - 0.1015625 = -0.0015625=y; 1/y = -640; try -2^9 gives the approximation 0.099609375. For this particular example it turns out that the series 1/2^3 - 1/2^5 + 1/2^7 - 1/2^9 . . . with alternating signs every 2 powers of two. You can pick the precision you want by how many shifts. Depending on how fast the multiply operation is the trade off between the shift and add/subtract may eventually take longer, but on pretty much any CPU without floating point, a single right shift is going to be faster than multiply and often several right shifts will still be faster.

By the way, for the beginners, the AVR code for the above divide by ten would be

```uint8_t  a,b;
uint16_t c,d;
uint32_t e,f;

b = (a>>3) - (a>>5) + (a>>7);  // b = a/10
d = (c>>3) - (c>>5) + (c>>7) - (c>>9) + (c>>11) - (c>>13) + (c>>15);   // d = c/10
f =  (e>>3) - (e>>5) + (e>>7) - (e>>9) + (e>>11) - (e>>13) + (e>>15)  - (e>>17) + (e>>19) - (e>>21) + (e>>23) - (e>>25) + (e>>27)  - (e>>29) + (e>>31);   // f = e/10
```

Mike

mbushroe wrote:
By the way, for the beginners, the AVR code for the above divide by ten would be

Silly question but have you actually counted the cycles for that "sea of shifts" compared to just doing the div ? Certainly for the last one (f=) I bet a single call to the library divide code will cost less than 21/23/25/27.. shifts!

Have you done any tests?

100/10 =10

you get

12-3+0=9

```uint8_t a;
uint16_t b;
uint32_t c;

uint16_t temp1;
temp1 = a * 26;
a = high_byte(temp1)    // a = (a * 26)/256

uint32_t temp2
temp2 = b * 6554;
b = high_word(temp2)    // b = (b * 6554)/65536

uint64_t temp3
temp3 = c * 429496730;
a = high_dword(temp3)    // c = (c * 429496730)/4294967296```

When dividing by a constant, there is often a way to multiply by a value and then use upper bits for result...

(The above should be done in assembly for optimum efficiency)

David

Last Edited: Tue. May 24, 2016 - 03:15 PM

clawson,

You have a very valid point. I went back and looked at the App Note 200 which gave some information about multiply and divide subroutines and the code space and execution speed needed for each. I assume that these are code wise a single function that will handle all multiplies or divides of the correct sizes. I did not, however, see a single 32bit by 32 bit divide, so none of the offered routines comes close to reaching the large numbers and/or fine resolution fixed point arithmetic that I was using in my code. And the code as presented above might not be the best way to write it so that it compiles well. And I must also admit that although I have assembly coded 6502s, 68000s, 286s, and machine coded an earlier 16 bit machine I forget the name of, I have not assembly coded the ATMega or other embedded, RISC, memory mapped register CPUs before, so my register use may be a bit off.

If we scale back to the most difficult case handled in AppNote 200, the 16/16 with 16 result then we are looking at the middle case with _only_ 7 shifts right. Rewriting to a more confusing but closer to assembly code version we used nested short shift rights and add/subtract the next

```d = (c>>3) - (c>>5) + (c>>7) - (c>>9) + (c>>11) - (c>>13) + (c>>15);   // d = c/10

d = (c- (c- (c- (c- (c- (c- (c>>2) )>>2  )>>2  )>>2  )>>2  )>>2  )>>3; ```

using 2 16 bit combined registers, and assuming that the instruction times listed apply for pair registers as well as single

```                 clocks
xor cd,cd   1   // clear results rgister pair
ld  ab,_C   2   // load 16 number to be divided by 10
lsr ab,2    1   // do innermost shift, with result in >>15
movw  cd,ab 1   // since result is empty can straight load for first step
lsr ab,2    1   //  do next outer, result >>13
sub cd,ab   1   //  alternating signs in original expression mean always sub here
lsr ab,2    1   //  >>11
sub cd,ab   1
lsr ab,2    1   // >>9
sub cd,ab   1
lsr ab,2    1   // >>7
sub cd,ab   1
lsr ab,2    1   // >>5
sub cd,ab   1
lsr ab,3    1   // first one was >>3
sub cd,ab   1   // _C/10 now in cd
ld  _d,cd   2   // save result in memory
clocks  19```

This means that a specialized, one divisor only 16bit by 16bit divide would complete in 19 clock cycles, which is much faster then the execute optimized divide subroutine of 117 clock cycles. The advantage of using the routine is that it works for any divisor. But if your program needs only one dedicated division by a fixed number, you can essentially do the long division first by hand, and only load in the actual math steps that get used, with a nearly factor of 10 improvement in speed. Other divisors may not need as many shift terms to get adequate accuracy. And I don't know how much slower a 32bit by 32bit general purpose integer divide would be, accept that I would expect a minimum of 4 times the 16x16.

Mike

OOPS!

yes, you have me there. This method is still an approximation. The problem is that dividing by the smaller terms loses the fractional portion. An alternative would be to recast the 8 bit value 100 into a 16bit value where it is in the high byte, do exactly the same steps, then take only the high byte of the result. Then the lost fractional parts are nolonger lost and the answer is more correct (probably never perfect across all numbers). But having to use a 16x8 divide operation on an originally 8 bit value does seem a wast.

I guess that in order to get the exact answer you have to do the long division.

EDIT: Further testing show that small number sometimes read 1 high, large numbers sometimes read 1 low. This is due to having numbers small enough to only get non-zero responses from 1 or 2 terms, which are not as accurate as all 3 for the byte operation. You almost have to run the routine to approximate 1/10th, then multiply the result by ten and check for being over or under and then correcting by one. No longer sounds quiet so efficient.

Mike

Last Edited: Tue. May 24, 2016 - 07:56 PM