## Fast conversion of Integer to BCD; assembly atmega328p.

141 posts / 0 new

## Pages

Total votes: 0

Can the double dabble method be implemented faster?   How many cycles for 16bits-->5 digits?

https://en.wikipedia.org/wiki/Do...

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Total votes: 0

Can the double dabble method be implemented faster?   How many cycles for 16bits-->5 digits?

https://en.wikipedia.org/wiki/Do...

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Total votes: 0

Recursivity. Indeed, requires coffee.

Total votes: 0

read #18

Total votes: 0

When you add 6 to     9 you get   15.

When you add 6 to 249 you get 255.

249 is "F" for base 250', and 255 is "F" for 256, 9 is "F" for 10.

If there is a "Double dabble" between 16 and 10,

is there one between 16 and 250?

You can shift 8 bits very easy, and from 250 to 1000,

that's my point.

10^0-1 =   9
16^0-1 =  15 (F)

250^0-1 = 249
256^0-1 = 255

15-6 =     9
255-6 = 249

Last Edited: Wed. Feb 3, 2021 - 02:18 PM
Total votes: 0

Just info, I have never found a good way to use it but:

since 10 is even, you can shift once down and just remember if the number is odd or even.

Then do everything with 5 instead of 10.

if number was odd add one to 1's (will never overflow).

Total votes: 0

I'm already dividing first by 4, for these 4 reasons:

- 4*250=1000

- you can get the hundreeds digit from one byte:

2^10=1024 so, you need 10 bits for 999,

but the last two bits have no influence on the hundreeds

Shifting back 2 bits you get tens and units

- the number to convert is 2bits shorter so,

you don't have to worry about carry

- the number to convert is lower

As you said 10 divides by 2; 1000 divides by 8,

but 8 doesn't help 'couse of the 8 bit register.

Last Edited: Mon. Feb 1, 2021 - 04:18 PM
Total votes: 0

sparrow2 wrote:

The way OP do it he will never get under 50 clk. He will get close to my code around 70clk.

(I can never find the code that do it in less than 50 clk someone help! ).

And I think to get under 40 clk you will need one or more LUTs .

Possibly:

1. Divide by 100, and use the above code to convert the 2-digit remainder to BCD.

2. Repeat step 1.

3. convert 0-16 to BCD in 3 cycles

I have no special talents.  I am only passionately curious. - Albert Einstein

Total votes: 0

how do you do 3. in 3 clk? (easy to do in 5 clk)

add:

And why [0..16] and not [0..6]

Last Edited: Tue. Feb 2, 2021 - 08:32 AM
Total votes: 0

If I take your code, use it(as it is and as model of thinking), and combined with mine(4x250),

can be arround 60 clk, 16 bits, result 0 to 06:55:35 BCD packed.

I take this as a first target.

Total votes: 0

It's easier, for me, to divide by 1000, than by 100;

and I have purified my code.

Already have a first "draft", and seems to work;

still have to do the testing, and the comments, but I feel I have to sleep. Now.

Total votes: 0

Hi Catalin,

I am impressed with your somewhat unusual approach.

On an ATmega328P I had all 2^16 numbers converted ​​(without errors ). The timing envelope per scope is as follows:

At 16 MHz, the average conversion time is 5 µsec, which means that 80 CPU cycles average per conversion can be concluded, including RCALL, RET and I/O instructions.

While all numbers ​​are being run through, a strong jitter between approx. 4.5 µsec and 5.7 µsec can be observed, which is caused by the many queries.

I recently implemented the well-known Douglas W. Jones method using my structured AVR assembler s’AVR for calculating large factorials, see http://www.led-treiber.de/html/f...

Converting all 2^16 numbers ​​using this Bin2Dec algorithm results in the 2nd timing envelope (also @ 16MHz):

The conversion time for almost all numbers is constantly 6 µsec, which means 96 CPU cycles (including RCALL, RET and I/O instructions).

There are only very few numbers (namely 185, all >= 49146) that require up to 6.8 µsec pure conversion time.

In case you don't mind: I've rewritten your assembler program for my s’AVR precompiler (and kept the replaced instructions as comments):

```;
; Two_Bytes_To_BCD.asm
;
; Created: 1/30/2021 6:43:29 PM
; Author : cata
;
; structured AVR assembler by s'AVR

.cseg
.org 0x0000
jmp reset

.dseg
.org SRAM_START		; 0x0100  (m328pdef.inc)

.cseg
.org INT_VECTORS_SIZE ; 0x34 (m328pdef.inc)

reset:
.DEF SIX=R2

ldi	r16,6
mov	SIX,r16		; SIX=R9=6

ldi r16,low(12735)
ldi r17,high(12735)

; General Ideea: divide R17:R16 by 1000 (rest 999)
; I. Divide by 4
lsr r17
ror r16
ror r3
lsr r17
ror r16
ror r3	; div. by 4 (remainder stored to r6)

; II. Divide by 250 (max. \$3fff)

; R17 is Quotient and R16 is Remainder
;	initially for the 256 division
;	and finally for the 250 division

; How many times we had to ADD SIX,
; based on the 256 Quotient? That is Quotient*SIX
mul r17,SIX  ; R1=Quotient, R0=Remainder (new ones)

; We have new Q and R but, we have to do this here.
; the 4 instr. below do to things
; first  - if Remainder>249, INC Quotient and
;							 add SIX on Remainder
; second - assures No Carry on next "add SIX" on Remainder
; cpi r16,250
; brlo PC+3
IF r16 == #250
add r16,SIX
inc R17
ENDI
; we can't do it before the multiply; 'cause we multiply R17

; the new Quotient can be only 0 or 1; max. 3f*6= 017A
; or r1,r1		; set ZERO flag if r1=0; r1 is new Q
; breq PC+3		; branch if zero
IF r1 == #0
inc R17		; increment Quotient(R1 was one)
add r16,SIX   ; add SIX; because we INC Quotient
ENDI
add r16,r0	  ; we add Remainder on the old one
; brcc PC+7		; but if we have CARRY?
IF C
inc r17		; increment Quotient
; cpi r16,250	; can't simply add SIX    (I)
; brlo PC+3		; and that is because of the overflow
IF r16 >= #250
add r16,SIX
inc r17
ENDI
add r16,SIX	; now we can add SIX; see (I)
ENDI
; and, again Remainder should be lower than 250
; cpi r16,250
; brlo PC+3
IF r16 >= #250
add r16,SIX
inc r17
ENDI
mov r4,r17		; but we need R17(on SUBI) !
; R4 will be the Remainder
; now  R4 = Quotient   on 250 division
; and R16 = Remainder  on 250 division

; we have a 10 bit Remainder (see I. Divide by 4)
; never mind we can manage; see below
clr r17	; hundreds digit for the Remainder
; cpi r16,150
; brlo PC+3
IF r16 >= #150
subi r16,150
mov r17,SIX
ENDI
; cpi r16,75
; brlo PC+3
IF r16 >= #75
subi r16,75
subi r17,-3
ENDI
; cpi r16,25
; brlo PC+7
IF r16 >= #25
subi r16,25
inc r17
; cpi r16,25
; brlo PC+3
EXITIF r16 < #25
subi r16,25
inc r17	; hundreds digit for the Remainder
ENDI
lsl r3		; and ReStore "the lost two bits"
rol r16
lsl r3
rol r16

clr r18		; max. 99 byte conversion to BCD
; cpi r16,60		; if greater than sixty
; brlo PC+3
IF r16 >= #60
mov r18,SIX	; the tens would be at least SIX
subi r16,60   ; and substract 60
ENDI
; cpi r16,30		; if initial number or previous SUB
; brlo PC+3
IF r16 >= #30
subi r16,30
subi r18,-3
ENDI
; cpi r16,10		;  ... and so on
; brlo PC+7
IF r16 >= #10
subi r16,10
inc r18
; cpi r16,10
; brlo PC+3
EXITIF r16 < #10
subi r16,10
inc r18
ENDI
; now you can review the previous
; "hundreds digit for the Remainder"
; because it is the same thing

swap r18		; these are for BCD packing
or r16,r18		; and freeing R18
mov r0,r16		;    and R16

mov r16,r4		; remember we left the Quotient on R4
clr r18		; max. 69, byte conversion to BCD
; cpi r16,30		; this is smaller, because 69<99
; brlo PC+3
IF r16 >= #30
subi r16,30
ldi r18,3
ENDI
; cpi r16,20
; brlo PC+3
IF r16 >= #20
subi r16,20
subi r18,-2
ENDI
; cpi r16,10
; brlo PC+3
IF r16 >= #10
subi r16,10
inc r18
ENDI
swap r16		; again BCD packing
or R17,r16		; Result in R18:R17:R16
mov r16,r0		; used registers: R0,R1,R2,R4,R16,R17,R18
; s'AVR: R3 as well

.UNDEF SIX
; less than 80 cycles on my estimation
LOOP
nop
; rjmp PC-1
ENDL
```

After compiling, of course, the result is the same assembler code as the original one (if I haven't made any mistakes), see attached.

I'm curious what your next version will look like.

Edit: Due to the test loop, the timings include RCALL, RET and I/O (1x each), which is 9 CPU cycles.

## Attachment(s):

Last Edited: Tue. Feb 2, 2021 - 06:14 PM
Total votes: 0

That's very good feedback; I saw that is =80*62.5nS=5uS,

but this ideea about measuring with the Oscilloscope helps me a lot.

Thank You.

On two bytes, as I said, seems to be ~60*62.5=3.75uS at 16MHz, this time.

Feel that will be better at 3 bytes, too.

Now,  think I should stop my mind,

and get to the  A. Studio.

Total votes: 0

Ok I found the fast code (less than 50 clk for 16bit):

https://www.avrfreaks.net/commen...

And since that is faster than my code (less than 70 clk for 16bit), I haven't look at it since, I guess that it can be optimized to around 60 clk, but not lower.

Add:

So perhaps all in a good combo of the different things make a better solution, but if it take more than 50 clk, it needs to be smarter in some other way, (small use less registers).

And in some situations it matter which digit can come out first (start sending first digit before the hole number is found)

Last Edited: Tue. Feb 2, 2021 - 12:45 PM
Total votes: 0

I'll try my best after test and post the self imposed "mixed version".

That is for promoting the idea of "divide by 1000" on a larger than 2 bytes conversion method.

For two bytes "divide by 1000" it will not be otherwise my first aproach. Your method (Excel) is better.

Thank You.

Total votes: 0

Att. tested 0-65535.

; if I counted well it is between 55 to 69 clk with INPUT
;  or 53 to 67 if we do not count INPUT on R17:R16 (AVG 60clk)
; if you don't use MACRO and make a CALL it is 63 to MAX 77 clk
;  I will say it is:    63 clk as MACRO       70 clk as CALL
;
; Average on not counted input, from 0 to 65535    = 53.53 clk
;        ( measured&calculated based on AtmelStudio Counter )
; ( and this supports my "hand counting" -min.53 without INPUT)

Thank You.

PS.

Saw one mistake in comments: it is previous 5 instr.; that is without MUL; counting still good

" mul r17,SIX    ; that is 6 times QB256 + possible RB250 in RB256
; R1 can be only 0 or 1; max. 3f*6= 017A; \$ff shifted right 2 times
lsr r1            ; r1=0 and possible 1 shifted to C
brcc PC+4        ; branch if zero (evident)
inc R17        ; increment Quotient(R1 was one)
add r16,SIX   ; add SIX; because we INC Quotient
brcs PC-2    ; possible one do-it-again on Carry resulted from last instruction
; On previous 6 instructions"

PS2. last version(minor modif. on comments)  only in Att. at first Post ( #1 )

## Attachment(s):

Last Edited: Tue. Feb 2, 2021 - 04:58 PM
Total votes: 0

Hope Wikipedia will mention "divide by 1000" someday, as well.

Total votes: 0

OK, here's the envelope for your # 2:

The fastest conversion was 3.88 µsec @16 MHz or 62 cycles in total, equal to 53 cycles for the conversion, since 9 cycles can be subtracted for RCALL, RET and 1x I/O due to the test loop.

I also ran El Tangas' algorithm for all 2^16 numbers:

Constant 3.63 µsec @16 MHz result in 58 cycles in total or 49 cycles for the conversion, as indicated in the source code.

Edit: Updated correct screenshot.

Last Edited: Tue. Feb 2, 2021 - 05:56 PM
Total votes: 0

Can you give me a more precise direction. Think I'm lost. I'm looking for:

Search found 1289 items

• El Tangas
Total votes: 0

see link given at #65

Total votes: 0

Thanks a lot.

I'll look. Just I can't  pass the fact that this is unpacked, and I Packed the BCD. This is "first look" only and I might be Really wrong.

`unpacked BCD in R24:R23:R22:R21:R20`
Total votes: 0

Yep, it is 5 digits unpacked, like the quite old algorithm by Douglas W. Jones, which I have used for AVR (see above).

Total votes: 0

Is it the rjmp from the start of the Code Segment? The code itself itself is 49 clk; counted by "my hand" and atmel Studio.

.cseg
.org 0x0000
r jmp reset

Also seems like a very good aproach, as in "Merge Sort".

It is shorter and much more "elegant" than mine.

But, I didn't sign to enter a competition.

Simply, I wanted just to share an Idea.

Last Edited: Tue. Feb 2, 2021 - 06:19 PM
Total votes: 0

Packing cost few +steps/cycles; and we seem to be "On the Edge".

Nothing more than the evidence.

Total votes: 0

Yes, it is on the ideea "Your method (Excel) is better. " formulated in #66 .

Never knew who's the author, and may be not originated from by sparrow2. Please excuse.

Excel can't be (or it is not imaginable by me, at this point) used on realy large numbers; three bytes up.

And when I say Excel, I mean VISICALC. This is not I'm rude, it is just a way of speaking.

I realy do not want to ofend anybody, it might be I'm simply to tired.

Below it's how I found the "magic":

ldi r20,164    ; found "164" with Excel, as suggested by:

mul r20,r16    ; https://www.avrfreaks.net/users/...

Last Edited: Tue. Feb 2, 2021 - 06:59 PM
Total votes: 0

sparrow2 wrote:

how do you do 3. in 3 clk? (easy to do in 5 clk)

add:

And why [0..16] and not [0..6]

I got mixed up with the 3-instruction packed BCD 0-16 code I posted earlier in this thread.

So 0-6 is what it should be after dividing by 100 2x.

I have no special talents.  I am only passionately curious. - Albert Einstein

Total votes: 0

sparrow2 wrote:

Ok I found the fast code (less than 50 clk for 16bit):

https://www.avrfreaks.net/commen...

I count exactly 40 instructions and 50 cycles, excluding the instruction.

I have no special talents.  I am only passionately curious. - Albert Einstein

Total votes: 0

It was on the three bytes conversion.

Your code is good:

cpi r18, 10

brlo 1f

subi r18, lo8(-6)

1:

max 256^3-1=16 777 215 ; here we talk about 16

I'm know at 256^2 (only two bytes) so max 65535/1000. gives us max 65.

I'm dividing first by four and next by 250. So I have 65 (think I got it wrong somewere speaking about QB250 on two bytes); it is only one.

This added later: found the "wrong" posted on #1, and corrected (fyi)

The remainder is 535; 999<1024 and fits in 10 bits. Q in base 250, is max. 65 and fits in one byte.

Yes, BCD unpacked needs two. This is not the case on my final result 06:55:35; this is only not to be confused with 655360.

Did I understood the problem?

Like this:

see also #13  Posted : Sat. Jan 30, 2021 - 09:55 PM (Reply to #11)

Last Edited: Wed. Feb 3, 2021 - 11:31 AM
Total votes: 0

This is posted two times. I thoght I can just delete it.

I replied on to that code strating at #74.

I counted 49 Clk. So did Atmel Studio. May be it is a just a slithly diff. in code.

Code has no start rjmp, so it might be interpreted like 51 that if runing like s'AVR did on #63,

but it is not 51. It's plain and clear 49clk.

Last Edited: Tue. Feb 2, 2021 - 10:12 PM
Total votes: 0

I replied on to that code strating at #74.

I counted 49 Clk. So did Atmel Studio. May be it is a just a slithly diff. in code.

Code has no start rjmp, so it might be interpreted like 51, if runing like s'AVR did on #63,

but it is not 51. It's plain and clear 49clk.

Recovered from my email box, here's the mtext I've replied to:

"Constant 3.76 µsec @16 MHz result in 60 cycles in total or 51 cycles

for the conversion, although it should actually be 49 according to the source

code.

I wonder where those 2 cycles are hidden ... "

Last version, of my code,  only in #1.

Last Edited: Tue. Feb 2, 2021 - 10:20 PM
Total votes: 0

Of course, if you really wanted to convert hex to decimal fast, get external hardware.  Given that the point of converting to decimal is for display, and going straight through to seven segment displays, it would not be difficult to wire up two 22V10 GALs to turn 16 bits of hex into four digits decimal on a four-digit seven-segment; and even display  'Err' if it overflows 9999.

You might be able to do it with 16V8s - Think you'll run low on pins, though.  And you have to ping-pong digits - each chip will only display one digit at a time.  A CPLD like the now-gone (and much lamented!) Xilinx XC9572 wouldn't even have that problem.

So you need seventeen pins (two eight-bit ports and a latch signal (or ten - one port and two latches, but that's slower)) and you've got it out of the AVR in four(ish) cycles.  Digit ping-ponging is left as an exercise for your external circuitry.    S.

Total votes: 0

Given that the point of converting to decimal is for display, and going straight through to seven segment displays, it would not be difficult to wire up two 22V10 GALs

I was actually wondering (insanely) how it would look to use a logic simplification for each segment (maybe combing some common terms), rather than a "calculation"  approach

like: digit 2, seg E=  b15*b12*b3 + b7*b3 *(b2 + b5)  ...etc

...would be "interesting" to see the logic minimization.....get out your quine-mcluskey or other methods

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Total votes: 0

In fact I was pretty good with Hexa printed on my Tera Term.

Simply my mother visited and she saw what I was looking at.

She never asked me to do something, after I said "it's Hexa".

She left and probably my mind got screwed somehow.

Total votes: 0

avrcandies wrote:

get out your quine-mcluskey or other methods

I have programs that do that for me.

But a preliminary think says it should be doable.  Have not yet built and tested a solution - if you don't like, your money back.  S.

Total votes: 0

I'll simply make a table. The real chalenge will be sorting the table;

a permutation of BRLO's, in these large GB ram we have today,

works very nicely with multicore technology. Don't you think?

Total votes: 0

Catalin Ioan Stanciu wrote:

"it's Hexa".

For some of my very early AVR designs, I didn't even bother to convert output ASCII into hex.  I just added 0x30 and learned to interpret the punctuation marks that came out when it overran the 0-9 range.   S.

? = F, among others.  S.

Total votes: 0

I've replied in romanian(to my mother), and translated only the HIGH part to you.

The LOW got out twice, unintentionally untranslated and undetected.

Last Edited: Tue. Feb 2, 2021 - 11:37 PM
Total votes: 0

Lost in Translation ...

I have a strange brain which works funny on me.

Now that hex is gone, I felt like I lost something and

I thought "it surely must be speed".

Total votes: 0

Just a month into the year and we are already discussing the classic binary to decimal conversion. This time I was busy and arrived late to the party

I guess I'll just post some links to our previous discussions, even some that have already been mentioned. This way, next time I can just refer to this thread

https://www.avrfreaks.net/forum/optimizing-libc-integer-conversion-routines

https://www.avrfreaks.net/forum/integer-string

https://www.avrfreaks.net/forum/fractional-division

https://www.avrfreaks.net/forum/division-only-bitwise-shifts

https://www.avrfreaks.net/forum/fast-dividing-5-and-10-uint32t-input-data-atmega16a

https://www.avrfreaks.net/forum/avr-assembler-extract-each-3-digit-number-each-register-r23-r24-r25

Total votes: 0

You're most wellcome!

It is very good to have both you and the links here.

And now to the point.

A computer should do what I want, and not me to do
the way someone else likes it.
Multiplication is the way we (humans) do division.
On large numbers we do division by 1000; thousands millions billions trillions etc.
This is one method to do /1000, applied by chance
on conversion from binary Int to BCD.

I hoped I can pass this to others.

PS1. Last version, of my codes,  only in #1.  Please excuse inherent and inadvertent errors.

Sources are tested, but not the comments. Suggested comments are welcome.

PS2. On my first post I intended to go four bytes and up, but I had to go down to two, for explanations.

Last Edited: Wed. Feb 3, 2021 - 11:46 AM
Total votes: 0

Catalin Ioan Stanciu wrote:
49
Catalin Ioan Stanciu wrote:

This is posted two times. I thoght I can just delete it.

I replied on to that code strating at #74.

I counted 49 Clk. So did Atmel Studio. May be it is a just a slithly diff. in code.

I think you are looking at the incomplete version that's missing the clr r1.  If you count the instructions excluding ret in the final version, it's 40 instructions, 10 of which are mul that take 2 cycles, so the total is 50 cycles.

I have no special talents.  I am only passionately curious. - Albert Einstein

Total votes: 0

it's 39+10, on the you reposted from sparrow link,  at. #78

"sparrow2 wrote:

Ok I found the fast code (less than 50 clk for 16bit):

https://www.avrfreaks.net/commen...

I count exactly 40 instructions and 50 cycles, excluding the instruction."

Last Edited: Wed. Feb 3, 2021 - 12:46 PM
Total votes: 0

Catalin Ioan Stanciu wrote:
I count exactly 40 instructions
Just an idea but isn't it a whole heap easier to get the AS7 simulator to count the cycles for you? (no chance of error that way).

Total votes: 0

Sorry, I counted 39+10, on a post(reposted) by ralphd , and he said it's 50.

added later: #78 Posted : Tue. Feb 2, 2021 - 11:19 PM, by ralphd

I already counted with AtmelStudio, and emphasize it, in previous posts.

BTW. Nice to have you here, Sir.

Last Edited: Wed. Feb 3, 2021 - 01:13 PM
Total votes: 0

the clr r1 is only to make the C compiler happy, nothing to do with the algorithm!

Total votes: 0

Thank You.

Total votes: 0

Now when talk about packed and unpacked (and sometimes ASCII), it depends of how the result is used!

It don't make sense to make a fast algorithm if the you need to do a mapping later.

In most cases you don't want packed BCD (yes I know it only take 3 clk to unpack a byte).

The most common way to place the result (16 bit int version) in 5 registers i order so you can read the result via a pointer.

Note Don't work on AVR0 AVR1 etc. sinse they are not compatible with "real" AVR's

Last Edited: Wed. Feb 3, 2021 - 02:29 PM
This reply has been marked as the solution.
Total votes: 0

Catalin Ioan Stanciu wrote:
A computer should do what I want, and not me to do
the way someone else likes it.

Indeed. The beauty of this classic problem is that it can be solved in many interesting ways, that's why I like it. You can see I've been interested in this for almost 20 years:  https://board.flatassembler.net/topic.php?t=3924

For example, your approach:

Catalin Ioan Stanciu wrote:

I'm working on a two byte version, hoping that it would be easier to understand the algorithm.

I do believe that is the algorithm that causes you difficulties.

It's similar to making the division by hand, with a pencil.

int(32735/256)=127    mod(32735/256)=223  ; you don't have to do this because it's just BYTE1 and BYTE2

on the other hand we are looking for:

int(32735/250)=130    mod(32735/250)=235

Based on 256=250+6: this means that 127*6+223=985

now int(985/256)=3  mod(985,256)=217   ; we are kiping score on the quotient!

and again 3*6+217=235,   BUT THIS TIME the rest is smaller than 250.

So the result will be 127+3=130 and the rest is 235

Does this make sense?

Took me a while to figure out, and that is a good thing

32735 = 127*256 + 223 = 127*(250 + 6) + 223 = 127*250 + 127*6 + 223 = 127*250 + 985

and then we divide 985 by 250.

I like it.

Last Edited: Wed. Feb 3, 2021 - 03:46 PM
Total votes: 0

You are the first; you noticed the algorithm. Great!

cited below from my Two Bytes:

";  depending on INPUT. And 19 to 33 clk from start."

that's after /1000

So, I made the Two Bytes /1000 in less than 33clk.

Of course it'll take longer on many bytes; but is a begining.

That's the Ideea.

Thank you.

Last Edited: Wed. Feb 3, 2021 - 04:12 PM
Total votes: 0

The title of this post is:

"Fast conversion of Integer to BCD; assembly atmega328p"

But,  it is just to make an example of a /1000 usage.

I dislike competition but, I joined it to make my point.

You are completely right about packed and unpacked,

and you helped me a lot.

Thank You.