Efficient 32-bit divide by 10 with remainder

12 posts / 0 new
Author
Message

Does anyone have a fast AVR assembler algorithm to divide a 32-bit unsigned number by 10 with remainder, i.e. as required for simple decimal output conversion? The traditional shift and subtract method takes about 400 clock cycles, so anything significantly faster than this would be useful.

Well divide by 10 is the same as multiply by 0.1 I guess and the binary representation of 0.1 is 0.0001100110011001100110011001100110011001100110011..(recurring) which is basically saying that 1/10th of a number is 1/16th + 1/32th + 1/256th + 1/512th + 1/4096th + 1/8192th + etc.

So if you take your number and shift it 4 places (1/16th) and put that into an "accumulator". Then shift it once more (1/32th) and add that to accumulator then shift 3 times, add that, once more add that, 3 times add that, once more, add that, 3 times, add that, once more add that and so on - stopping once the original value is 0 then you should have 1/10th (ish!) in the accumulator.

Having said all that I only just thought it up and cannot guarantee that it WILL work but I don't see why not. If it does work then the chance are I just re-invented the wheel!

Cliff

PS Just noticed you need a remainder too - it's not immediately obvious to me how that one "falls out" I'm afraid (well it is Monday morning!)

PPS Except to notice that x10d is x1010b so it's shift 2 + shift 4. Do that on result then subtract from original to get remainder.

clawson wrote:
Well divide by 10 is the same as multiply by 0.1 I guess and the binary representation of 0.1 is 0.0001100110011001100110011001100110011001100110011..(recurring) which is basically saying that 1/10th of a number is 1/16th + 1/32th + 1/256th + 1/512th + 1/4096th + 1/8192th + etc.

So if you take your number and shift it 4 places (1/16th) and put that into an "accumulator". Then shift it once more (1/32th) and add that to accumulator then shift 3 times, add that, once more add that, 3 times add that, once more, add that, 3 times, add that, once more add that and so on - stopping once the original value is 0 then you should have 1/10th (ish!) in the accumulator.

Having said all that I only just thought it up and cannot guarantee that it WILL work but I don't see why not. If it does work then the chance are I just re-invented the wheel!

Cliff

PS Just noticed you need a remainder too - it's not immediately obvious to me how that one "falls out" I'm afraid (well it is Monday morning!)

PPS Except to notice that x10d is x1010b so it's shift 2 + shift 4. Do that on result then subtract from original to get remainder.

Trouble is, 1/10 is a recurring binary fraction, so how far down do I have to carry out this process? Down to 32 bits, I suppose, which means at least 16 32-bit additions, not to mention lots of 32-bit shifts. And I would need a lot of extra registers to store partial results. And then of course I have to multiply back and subtract to get the remainder - not too bad using the MUL instruction, but I still think the whole process would take longer (and use a lot more code) than the 400-odd cycles for the conventional method.

I thought I had a solution by using the MUL instruction to multiply each byte by 25 or 26, take the MS byte of the result, multiply back by 10 and correct the result by +/-1 - messy but possibly faster than bit-by-bit - then I realised that after the first byte I would have a carry and so would have to divide a 2-byte number in the range 0-2559 by 10, and my method didn't seem such a good idea any more.

I just wondered whether, back in the old days of very restricted instructuin sets (and no hardware multiply or divide), the pioneers of programming had devised a cunning method of decimal conversion that I could apply to one of the restricted instruction sets of today. It's not looking very promising - even Knuth seems to bypass the problem by imagining a computer that works in decimal.

It's not too much of a big deal. A Mega168 running at 20MHz can convert five numbers to decimal in less than a millisecond - it would just be neat to get it down a bit faster.

PS. Just thought of a possible optimisation of your idea: Start by multiplying the original number by 3 (i.e. shift left 1 and add the original number - though of course the number might now be 34 bits!), then you need to multiply by .000010001 etc. This mostly involves only shifting by 4 (after the first shift), which can be done quickly using the SWAP instruction. I still think it would be slower to do the whole 32 bits at once like this, but maybe to divide 0-2559 by 10 and process a byte at a time (applicable to numbers of indefinite length) ... hmmm, needs more thought.

Take a look at this project (16-bit divide) and see f You can expand it to 32 bits?

As of January 15, 2018, Site fix-up work has begun! Now do your part and report any bugs or deficiencies here

No guarantees, but if we don't report problems they won't get much of  a chance to be fixed! Details/discussions at link given just above.

"Some questions have no answers."[C Baird] "There comes a point where the spoon-feeding has to stop and the independent thinking has to start." [C Lawson] "There are always ways to disagree, without being disagreeable."[E Weddington] "Words represent concepts. Use the wrong words, communicate the wrong concept." [J Morin] "Persistence only goes so far if you set yourself up for failure." [Kartman]

timbierman,

Yup for a full 32 bit number you would need quite a few shift/adds (15 worst case I think) but on smaller numbers, because you break out as soon as the original goes to 0 it could break much earlier. So the length of time to execute will be dependent on how big the original number is.

Cliff

JohanEkdahl wrote:
Take a look at this project (16-bit divide) and see f You can expand it to 32 bits?

Do you mean the "16/16 bit division" thread? I've looked there and it just seems to point to app note 200, which describes the traditional shift and subtract methods from antiquity. What I am looking for is a short cut, particularly in the case of dividing by ten, if it exists.

Or am I missing an attachment to your post?

You may want to have a look at
( Div16_xx Math Lib)
It's a performance optimized implemantation of 'division by reciprocal multiplication' based on Mr. Jones excellent primer for this stuff http://www.cs.uiowa.edu/~jones/bcd/divide.html
:)

Andreas

clawson wrote:
Yup for a full 32 bit number you would need quite a few shift/adds (15 worst case I think) but on smaller numbers, because you break out as soon as the original goes to 0 it could break much earlier. So the length of time to execute will be dependent on how big the original number is.

Cliff

I'm not sure how you'd decide that your result is going to be zero when you are shifting and adding - the next add might well make it non-zero. However, your general point is good: most numbers will be no more than 24, 16 or even 8 bits to begin with, and during decimal output the number will always shrink as conversion proceeds. I can skip 8, 16 or 24 times around the shift-subtract loop by shifting leading zeros off a byte at a time at the beginning.

Not quite the magic bullet I was looking for, but it could easily cut execution time by about half.

Always useful to talk these things through with someone else :)

@Tim:

```;***************************************************************************
;*
;* Function "Div16_10"
;* Divides an unsigned 16 bit word (XH:XL) by 10
;* Returns quotient in YH:YL and remainder in XL
;*
;* Author: Andreas Lenze (andreas.lenze@t-online.de)
;* Equations by D. W. Jones:
;*
;*	Reciprocal mul w. extra precision:
;*	unsigned int A;
;*	unsigned int scaled_reciprocal = 0xCCCD;
;*	unsigned int Q; /* the quotient */
;*
;*	Q = ((A * 0xCCCD) >> 19)
;*
;* Uses: high regs: 7 (r17, r18, r19, X, Y)
;*	 low regs:  3 (r0, r1, r2)
;*
;*	 words:     36 (w. push/pop = 8 words)
;*	 cycles:    46 (w. push/pop = 16 cycles)
;*
;* Note: Hardware multiplier required ("mul" instruction)
;*
;***************************************************************************

Div16_10:
push	r2
push	r19
push	r18
push	r17

ldi	YH,0xCC		; scaled reciprocal for /10
ldi	YL,0xCD

; Q = A * 0xCCCD
; (r19:r18:r17[:rXX] = XH:XL * YH:YL)
clr     r2
mul     XH, YH		; ah * bh
movw    r19:r18, r1:r0
mul     XL, YL		; al * bl
mov	r17,r1		; r0 to [rXX] is superfluous
mul     XH, YL		; ah * bl
mul     YH, XL		; bh * al

; Q = Q >> 16: use r19:r18 as word
; Q = Q >> 3
lsr	r19		; do the last 3 shifts
ror	r18
lsr	r19
ror	r18
lsr	r19
ror	r18

; r19:r18 now "Q" (= result >> 19)
; R = A - 10*Q;
ldi	r17,10		; multiply r19:r18 by 10
mul     r18, r17	; al * bl
sub	XL,r0
clr	XH
movw	YL,r18		; make copy of Q

; XL holds "R"
; YH:YL holds "Q"
pop	r17
pop	r18
pop	r19
pop	r2

ret

;**** End of function Div16_10 -----------------------------------------****
```

This is from above mentioned 'academy project' - uses 30 cycles (w/o push/pop) for a 16 bit divide. If you're looking for 'magic bullets', this *could* be a starting point for further research.

Andreas

Last Edited: Mon. Nov 14, 2005 - 01:08 PM

alenze wrote:
You may want to have a look at
( Div16_xx Math Lib)
It's a performance optimized implemantation of 'division by reciprocal multiplication' based on Mr. Jones excellent primer for this stuff http://www.cs.uiowa.edu/~jones/bcd/divide.html
:)

Thanks - that's the sort of thing I was looking for. I'll have to see how easily it generalises to 32 bits. My first concern is that I might need a second bank of 4 registers to accumulate what is basically a 64-bit result. The program in which I need to do this has most of the register bank permanently allocated to a time-critical interrupt routine, so I can't borrow registers temporarily for foreground tasks. Still, it may be possible.

Sorry, timbierman, I totally failed to paste the link to the project I wanted to tip You about. All is well that ends well though as alenze posted a link to the very same project (the reciprocal division stuff). Hope I didn't cause too much confusion!

As of January 15, 2018, Site fix-up work has begun! Now do your part and report any bugs or deficiencies here

No guarantees, but if we don't report problems they won't get much of  a chance to be fixed! Details/discussions at link given just above.

"Some questions have no answers."[C Baird] "There comes a point where the spoon-feeding has to stop and the independent thinking has to start." [C Lawson] "There are always ways to disagree, without being disagreeable."[E Weddington] "Words represent concepts. Use the wrong words, communicate the wrong concept." [J Morin] "Persistence only goes so far if you set yourself up for failure." [Kartman]

JohanEkdahl wrote:
Sorry, timbierman, I totally failed to paste the link to the project I wanted to tip You about. All is well that ends well though as alenze posted a link to the very same project (the reciprocal division stuff). Hope I didn't cause too much confusion!

No problem. I don't have time to check out a 32-bit version of the routine in the project right now, but it looks promising. So far I've managed to knock 50% off total message formatting time by using the multiply-by-reciprocal method for 8-bit divisions, and optimising 32-bit divisions where the top 1, 2 or 3 bytes are zero. Not a bad day's work so far.

(This may all seem like cheese-paring, but the messages I need to convert come in at anything up to one every 2 milliseconds, so shaving half a millisecond off processing is a big improvement).