## Simple formula for binary to BCD conversion

28 posts / 0 new
Author
Message

Inspired by the interesting discussions about fast binary to decimal/BCD conversions, I thought about how such a conversion could be expressed with a simple formula.

Here it is for packed BCD encoding in C language (I would assume it is state of the art):

```unsigned long bin2bcd (uint32_t n)     // convert 16 bit binary n to 5 digit BCD (packed)
{
n = n + 6 * (n/10 + 16*(n/100) + 256*(n/1000) + 4096*(n/10000));
return n;
}```

Using GCC this function runs on an ATmega328P in typically 158 µsec  @16 MHz.

This is a timing envelope for converting all numbers 0 to 2^16-1:

The screenshot was taken using this program, which also lists the decimal and BCD numbers on an attached terminal:

```    for (i = 0; i <= 65535; i++)
{
LED_ON;
m = bin2bcd(i);
LED_OFF;
printf ("%5lu : %05lX\n", i, m);
}  ```

As expected, the given function runs much slower with GCC than with the discussed routines written in AVR assembly language.

I would be interesting to know how fast this formula is executed when using FPGAs.

Of course you can check the formula with Excel as well.

For example if you got a decimal number in cell A2, the 5-digit BCD number can be represented in cell B2 as follows:

=DEC2HEX(A2+6*(INT(A2/10)+16*INT(A2/100)+256*INT(A2/1000)+4096*INT(A2/10000));5)

Enjoy!

Well 2528 cycles certainly seems "dirt slow" compared to the others ...shows how a lot of cycles can get unnecessarily tossed into the trashcan.

At least that's a handy packed conversion & wasting cycles happens anyhow when a system is "doing nothing".

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

s'AVR wrote:
the given function runs much slower with GCC
There's four divisions in there. Why are you surprised?

But isn't this comparing apples and oranges? Surely the C code needs to express the same algorithm as being done in the Asm?

EDIT: for example this thread has a number of C algorithms that claim to be "fast" and on the whole avoid divisions:

https://www.avrfreaks.net/forum/...

Last Edited: Fri. Feb 5, 2021 - 10:45 AM

s'AVR wrote:

I would be interesting to know how fast this formula is executed when using FPGAs.

Given a half-decent FPGA, it could process every bit in parallel, thus one cycle (with some daisy-chained LUTs, but still - a fistful of nanoseconds).  The output's only 20 bits.  At this point the speed limit is "How fast can you get the input data out of the AVR".  S.

clawson wrote:
There's four divisions in there. Why are you surprised?

But isn't this comparing apples and oranges? Surely the C code needs to express the same algorithm as being done in the Asm?

Sure, I’m fully aware abouth the math behind the given formula and not surprised about the slow speed.

However, I wonder whether the simple formula (which is linear in certain sections) cannot be implemented in a more clever - but still simple - way (whether it is C or assembly) to speed it up.

And don’t worry, I got a collection of fast conversion routines written is assembly. It is just a general question.

Ok FPGA's are getting bigger  but I don't think you can get one that can hold everything in 1 clk.

but 3 clk is possible.

if the speed is 1GHz that would be about 300MHz output speed.

But 50-100MHz output speed would be realistic to make.

The way I looked at it, sparrow2, was that you have 20 output bits, and each can have 65,000(ish) inputs, therefore you need (absolute worst case!) about 1.4 million logic gates.  A midrange Xilinx Virtex chip these days has 2+ million gates, and the top-end of the line is close to 9 million.

This, of course, is entirely without optimization:  e.g. the three LSBs of the output are exactly the same as the LSBs of the input - so I just optimized away about 130,000 gates.  I'm sure there's many many many more that'll be redundant.  S.

ETA:  Of course, far cheaper than a big FPGA (those things can be hundreds of USD each) would be three big parallel EEPROMs (assuming 2^16x8) as a great thumping lookup table, and that would be one cycle.  S.

Last Edited: Fri. Feb 5, 2021 - 12:12 PM

I was thinking the way OP splved it

Then we need to define a gate, (normally 2 input NAND), and for speed one level deep, (or one LUT deep), but with 16 inputs it's a bigger than gate.

But if it's just one big LUT then it would be better with ROM(FLASH) 16 bit addr. and 20 bit out. or 192k flash so even a big AVR could solve that. So that would be about 2MHz output, (useless unless data in and/or out are moved via DMA).

s'AVR wrote:
the simple (sic?)  formula

So how are you defining "simple" here?

Clearly, for a lowly 8-bit processor like the AVR, division is not "simple" ...

Top Tips:

1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...

One of my more fundamental assumptions here is that there is absolutely no reason to convert to BCD except for display, so there never is any reason to read the output of the LUT back into the AVR, and the resulting BCD must be sent out.  That given:

Writing out two bytes hex is by definition faster than writing out three bytes packed BCD (not even considering the AVR's internal gymnastics for pointing to and fetching from the LUT.  Three LPMs to get the result out of your flash lookup are nine cycles just by themselves).  Another advantage to external ROM is the outputs automatically show up parallel - 20 bits on 20 pins (24, actually, since nobody makes 20-bit output parallel EEPROMs).  You might want yet another 'latch' signal indicating that the AVR is done outputting bytes, but with a synchronized clock that could be overlooked.  That's three cycles, including 'latch'.

A different problem is that those big parallel EEPROM chips are slooo...  70ns or worse.  If you want faster, get parallel SRAM and load them up with the LUT from an equally large serial EE every time you power-up.  FPGAs (at least Xilinx's ones) do that anyhow.  S.

Scroungre wrote:

This, of course, is entirely without optimization:  e.g. the three LSBs of the output are exactly the same as the LSBs of the input - so I just optimized away about 130,000 gates.

Can you explain what you mean?  I don't see any 1:1 mapping of between the input bits of a binary number and the packed BCD output.

I have no special talents.  I am only passionately curious. - Albert Einstein

For example, if the least significant bit of the input hex is set, then the number is odd, and therefore the least significant bit of the packed BCD is therefore set - because it's also an odd number.  Numbers larger than that do require some gating.

eta: And the hex input numbers 0x0000 through 0x0009 are bit-equal to the packed BCD output of 00000 through 00009.  Does that help any?

S.

Last Edited: Fri. Feb 5, 2021 - 01:21 PM

Scroungre wrote:

For example, if the least significant bit of the input hex is set, then the number is odd, and therefore the least significant bit of the packed BCD is therefore set - because it's also an odd number.  Numbers larger than that do require some gating.

eta: And the hex input numbers 0x0000 through 0x0009 are bit-equal to the packed BCD output of 00000 through 00009.  Does that help any?

S.

Yes, the last bit for even/odd is obvious, but I was looking for 3 bits that mapped 1:1 given your comment, "the three LSBs of the output are exactly the same as the LSBs of the input "

I'm guessing you didn't mean that they are always the same?

I have no special talents.  I am only passionately curious. - Albert Einstein

One of my more fundamental assumptions here is that there is absolutely no reason to convert to BCD except for display

who do you expect to read the display at that speed?

ralphd wrote:

Yes, the last bit for even/odd is obvious, but I was looking for 3 bits that mapped 1:1 given your comment, "the three LSBs of the output are exactly the same as the LSBs of the input "

I'm guessing you didn't mean that they are always the same?

Well, clearly they are not.

0x10 = 00010000b

10    = 00001010b

edit: I think it's important to understand an algo before trying to optimize it. So I spent a while figuring out the OP algorithm, first by converting small numbers. For numbers up to 99, you need to add 6*(int(number/10) to get the BCD representation. Then it fails for 100, you need to add an extra 6*((int(number/100) << 4) to generate the high order BCD digit, and so on for each higher order nibble.

For example, for 523:

Last Edited: Fri. Feb 5, 2021 - 03:26 PM

I was interested in what the logic minimization equations are for each segment?  Is the total 200 terms? 1000 terms? 130000 terms?  Are there a lot of common terms that can be reused between segments to reduce the total number?

What I'm saying, is if I gave you a handful of 3 input nand gates and some protoboard, what would it take to decode 16 input bits to around 35 output segment bits (5 digits, most significant digit, might not use all segments)?   Obviously there are old-school decoders for 4 bit to 7-seg.

I'm gonna warm up my soldering iron.

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Fri. Feb 5, 2021 - 03:54 PM

So to make the actual C versus Asm comparison that OP is apparently trying to do in this thread what does the recently developed algo (from that other thread) look like simply expressed in C?

Obviously a C build will be slower/bloatier (it's kind of bound to be when compared to hand crafted C) but it would be interesting to see a true apples for apples comparison.

clawson wrote:

what does the recently developed algo (from that other thread) look like simply expressed in C?

From what I understood, the algo from the other thread was actually a fast division by 1000. It's use for BCD conversion was just an "excuse" to present the division method.

The algorithm from this thread is based on "there is some magic number we can add to an hex number to convert it to it's BCD representation. Here is a formula to calculate such number."

So they are not even that related... this one only has one instance of division by 1000, and replacing that division by the fast algo wouldn't make a big difference.

El Tangas wrote:

ralphd wrote:

Yes, the last bit for even/odd is obvious, but I was looking for 3 bits that mapped 1:1 given your comment, "the three LSBs of the output are exactly the same as the LSBs of the input "

I'm guessing you didn't mean that they are always the same?

Well, clearly they are not.

0x10 = 00010000b

10    = 00001010b

edit: I think it's important to understand an algo before trying to optimize it. So I spent a while figuring out the OP algorithm, first by converting small numbers. For numbers up to 99, you need to add 6*(int(number/10) to get the BCD representation. Then it fails for 100, you need to add an extra 6*((int(number/100) << 4) to generate the high order BCD digit, and so on for each higher order nibble.

I did some experiments this morning, and came to the same conclusion.  Here's some prototype code from my notes:

cpi num, 10
brlo done
cpi num, 26
brlo done
...

I have no special talents.  I am only passionately curious. - Albert Einstein

I discovered that the OP's formula can be written in a different way. I would explain how to derive it, but unfortunately I was writing that when I lost the whole post

Meanwhile, I leave this demo https://godbolt.org/z/cTWva9

The drop in replacement for the OP's function is:

```// magic numbers
// tens: 6
// hundreds: 6*10 + 6*16 = 156
// thousands: 6*10*10 + 6*10*16 + 6*16*16 = 3096
// 10000s: 6*10*10*10 + 6*10*10*16 + 6*10*16*16 + 6*16*16*16 = 55536

#define M10 6
#define M100 156
#define M1000 3096
#define M10000 55536

unsigned long bin2bcd (uint32_t n)     // convert 16 bit binary n to 5 digit BCD (packed)
{
unsigned long result = n;
while (n >= 10000) {
n -= 10000;
result += M10000;
}
while (n >= 1000) {
n -= 1000;
result += M1000;
}
while (n >= 100) {
n -= 100;
result += M100;
}
while (n >= 10) {
n -= 10;
result += M10;
}
return result;
}```

I didn't do any effort to optimize it, but I believe it should be somewhat faster, since there are no divisions, I used successive subtractions instead (I know, not the fastest method either).

s'AVR wrote:

```unsigned long bin2bcd (uint32_t n)     // convert 16 bit binary n to 5 digit BCD (packed)
{
n = n + 6 * (n/10 + 16*(n/100) + 256*(n/1000) + 4096*(n/10000));
return n;
}```

If i worked it out right, everything inside the brackets fits in 16 bits, you only need to go up to 32 bits for the * 6 and then the add, so why not

```static uint32_t bin2bcd(uint16_t n)
{
return n + 6ul * (n/10 + 16*(n/100) + 256*(n/1000) + 4096*(n/10000));
}
```

I got about 115us on atmega328p running at 8MHz.

You might want to double check it gives the right answer still.

Optimized repeated subtraction:

ldi d5, -1 + '0'                   ; ten-thousands
1:  inc d5
subi r20, lo8(10000)
sbci r21, hi8(10000)
brcc 1b

ldi d4, 10 + '0'                   ; thousands
2:  dec d4
subi r20, lo8(-1000)
sbci r21, hi8(-1000)
brcs 2b

...

I have no special talents.  I am only passionately curious. - Albert Einstein

The code you show is a ASCII version (but could easy be changed to packed).

next step for a faster worst case is to deal with :

(1. digit special case)

deal with 4000 (and adjust with 4)

then deal with -1000 on same digit.

deal with 400 (and adjust with 4)

then deal with -100 on same digit.

...

...

I think all this is in a thread somewhere, but either for ASCII or unpacked BCD, but could easily be changed to packed (save a init instruction for every 2. digit).

sparrow2 wrote:

The code you show is a ASCII version (but could easy be changed to packed).

Yeah, just start off with:

ldi d5, -1

...

ldi d4, 10 + '0'                   ; thousands

2:  dec d4

do:

ldi d43, 0xA0                     ; thousands and hundreds

2: subi d43, 0x10

Then, as you pointed out, you save an init (ldi) on the hundreds

I have no special talents.  I am only passionately curious. - Albert Einstein

Yup. I've made a mathematical discovery (new to me at least) and this is something I cherish more than "optimization contests" and the like, because it happens only rarely.

This algorithm in fact works for packed or unpacked BCD the same, and even for any "source base" to "target base" conversion.

I already mentioned in post #20 the "magic numbers"

El Tangas wrote:

```// magic numbers
// tens: 6
// hundreds: 6*10 + 6*16 = 156
// thousands: 6*10*10 + 6*10*16 + 6*16*16 = 3096
// 10000s: 6*10*10*10 + 6*10*10*16 + 6*10*16*16 + 6*16*16*16 = 55536```

These are for packed BCD, in which source base = 10 and target base = 16. You can see the factors are all made up from these numbers; even the "6" is the difference between bases, i.e. 16-10.

For unpacked BCD, source base = 10 and target base = 256, therefore the magic numbers are:

```// magic numbers
// tens: 246
// hundreds: 246*10 + 246*256 = 65436
// thousands: 246*10*10 + 246*10*256 + 246*256*256 = 16776216
// 10000s: 246*10*10*10 + 246*10*10*256 + 246*10*256*256 + 246*256*256*256 = 4294957296```

example: https://godbolt.org/z/8Tn477

There are other interesting applications I'm exploring. I have to thank s'AVR

edit: I will use the OP's notation, as Excel example for unpacked BCD (as I mentioned before, my notation and the OP's are equivalent)

=DEC2HEX(A2+246*(INT(A2/10)+256*INT(A2/100)+256*256*INT(A2/1000)+256*256*256*INT(A2/10000));10)

Last Edited: Sat. Feb 6, 2021 - 01:53 PM

MrKendo wrote:

If i worked it out right, everything inside the brackets fits in 16 bits, you only need to go up to 32 bits for the * 6 and then the add, so why not

```static uint32_t bin2bcd(uint16_t n)
{
return n + 6ul * (n/10 + 16*(n/100) + 256*(n/1000) + 4096*(n/10000));
}
```

I got about 115us on atmega328p running at 8MHz.

You might want to double check it gives the right answer still.

Wow, that’s a big improvement.

Using this minor modification under GCC/AS7, the same "simple formula" is now even running at typically 13,7 µsec on Atmega328P @16 MHz!

I couldn’t believe it, but checked several times that all 2^16 numbers are converting correctly in the given time (per number).

Thank you for this hint, MrKendo!

Based on the same idea I also checked El Tangas’ modified formula for unpacked BCD by using this code:

```static uint64_t bin2dec (uint16_t n)     // convert 16 bit binary n to 5 digit unpacked BCD
{
return (n + 246ull * (n/10 + 256ul*(n/100) + 65536ul*(n/1000) + 16777216ul*(n/10000)));
}```

It’s running at typically 26,5 µsec @16 MHz. So the unpacked BCD formula takes 2x the time of the packed BCD formula.

Last Edited: Mon. Feb 8, 2021 - 06:07 PM

s'AVR wrote:
Using this minor modification under GCC/AS7, the same "simple formula" is now even running at typically 13,7 µsec on Atmega328P @16 MHz!

That seemed way faster than the 115us i was getting at 8MHz.

However, i had compiled with -Os.

I just tried -O2, code is a bit bigger, but low and behold I get about 26us at 8MHz.

The generated code is doing something clever that i haven't figured out yet, but it has got rid of all the divides.

EDIT -O3 similar to -O2

EDIT

it's effectively doing

/10000 as 839/128

/1000 as 8389/128

/100 as 5243/8

/10 as 52429/8

and this is overall scaled by 2^^16

so right shift the above by 16

839/128/65536 = 0.0001000165939

8389/128/65536 = 0.00100004673

5243/8/65536 = 0.01000022888

52429/8/65536 = 0.1000003815

Last Edited: Mon. Feb 8, 2021 - 10:25 PM

My guess is that it mul with 1/x , but not a hacked way, if you shift up to max number the result will be correct like:

div with 10 on a byte you mul with 26/256 = 0,1015625  that won't be correct, but if you shift the 26 up to the highest number that fit in a byte, 205/2048 = 0,10009765625 that will correct

for all combinations of byte size.