Simple formula for binary to BCD conversion

Go To Last Post
28 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Inspired by the interesting discussions about fast binary to decimal/BCD conversions, I thought about how such a conversion could be expressed with a simple formula.

Here it is for packed BCD encoding in C language (I would assume it is state of the art):

unsigned long bin2bcd (uint32_t n)     // convert 16 bit binary n to 5 digit BCD (packed)
{
    n = n + 6 * (n/10 + 16*(n/100) + 256*(n/1000) + 4096*(n/10000));
    return n;
}

Using GCC this function runs on an ATmega328P in typically 158 µsec  @16 MHz.

 

This is a timing envelope for converting all numbers 0 to 2^16-1:

 

 

The screenshot was taken using this program, which also lists the decimal and BCD numbers on an attached terminal:

    for (i = 0; i <= 65535; i++)
    {
        LED_ON;
        m = bin2bcd(i);
        LED_OFF;
        printf ("%5lu : %05lX\n", i, m);
    }  

 

As expected, the given function runs much slower with GCC than with the discussed routines written in AVR assembly language.

 

I would be interesting to know how fast this formula is executed when using FPGAs.

 

Of course you can check the formula with Excel as well.

For example if you got a decimal number in cell A2, the 5-digit BCD number can be represented in cell B2 as follows:

 

=DEC2HEX(A2+6*(INT(A2/10)+16*INT(A2/100)+256*INT(A2/1000)+4096*INT(A2/10000));5)

 

Enjoy!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well 2528 cycles certainly seems "dirt slow" compared to the others ...shows how a lot of cycles can get unnecessarily tossed into the trashcan.

At least that's a handy packed conversion & wasting cycles happens anyhow when a system is "doing nothing".

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

s'AVR wrote:
the given function runs much slower with GCC
There's four divisions in there. Why are you surprised?

 

But isn't this comparing apples and oranges? Surely the C code needs to express the same algorithm as being done in the Asm?

 

EDIT: for example this thread has a number of C algorithms that claim to be "fast" and on the whole avoid divisions:

 

https://www.avrfreaks.net/forum/...

Last Edited: Fri. Feb 5, 2021 - 10:45 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

s'AVR wrote:

I would be interesting to know how fast this formula is executed when using FPGAs.

 

Given a half-decent FPGA, it could process every bit in parallel, thus one cycle (with some daisy-chained LUTs, but still - a fistful of nanoseconds).  The output's only 20 bits.  At this point the speed limit is "How fast can you get the input data out of the AVR".  S.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
There's four divisions in there. Why are you surprised?

 

But isn't this comparing apples and oranges? Surely the C code needs to express the same algorithm as being done in the Asm?

 

Sure, I’m fully aware abouth the math behind the given formula and not surprised about the slow speed.

 

However, I wonder whether the simple formula (which is linear in certain sections) cannot be implemented in a more clever - but still simple - way (whether it is C or assembly) to speed it up.

 

And don’t worry, I got a collection of fast conversion routines written is assembly. It is just a general question.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ok FPGA's are getting bigger  but I don't think you can get one that can hold everything in 1 clk.

 

but 3 clk is possible.

if the speed is 1GHz that would be about 300MHz output speed.

But 50-100MHz output speed would be realistic to make. 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The way I looked at it, sparrow2, was that you have 20 output bits, and each can have 65,000(ish) inputs, therefore you need (absolute worst case!) about 1.4 million logic gates.  A midrange Xilinx Virtex chip these days has 2+ million gates, and the top-end of the line is close to 9 million.

 

This, of course, is entirely without optimization:  e.g. the three LSBs of the output are exactly the same as the LSBs of the input - so I just optimized away about 130,000 gates.  I'm sure there's many many many more that'll be redundant.  S.

 

ETA:  Of course, far cheaper than a big FPGA (those things can be hundreds of USD each) would be three big parallel EEPROMs (assuming 2^16x8) as a great thumping lookup table, and that would be one cycle.  S.

Last Edited: Fri. Feb 5, 2021 - 12:12 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I was thinking the way OP splved it

Then we need to define a gate, (normally 2 input NAND), and for speed one level deep, (or one LUT deep), but with 16 inputs it's a bigger than gate.

 

But if it's just one big LUT then it would be better with ROM(FLASH) 16 bit addr. and 20 bit out. or 192k flash so even a big AVR could solve that. So that would be about 2MHz output, (useless unless data in and/or out are moved via DMA).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

s'AVR wrote:
the simple (sic?)  formula

So how are you defining "simple" here?

 

Clearly, for a lowly 8-bit processor like the AVR, division is not "simple" ...

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

One of my more fundamental assumptions here is that there is absolutely no reason to convert to BCD except for display, so there never is any reason to read the output of the LUT back into the AVR, and the resulting BCD must be sent out.  That given:

 

Writing out two bytes hex is by definition faster than writing out three bytes packed BCD (not even considering the AVR's internal gymnastics for pointing to and fetching from the LUT.  Three LPMs to get the result out of your flash lookup are nine cycles just by themselves).  Another advantage to external ROM is the outputs automatically show up parallel - 20 bits on 20 pins (24, actually, since nobody makes 20-bit output parallel EEPROMs).  You might want yet another 'latch' signal indicating that the AVR is done outputting bytes, but with a synchronized clock that could be overlooked.  That's three cycles, including 'latch'. 
 

A different problem is that those big parallel EEPROM chips are slooo...  70ns or worse.  If you want faster, get parallel SRAM and load them up with the LUT from an equally large serial EE every time you power-up.  FPGAs (at least Xilinx's ones) do that anyhow.  S.

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Scroungre wrote:

This, of course, is entirely without optimization:  e.g. the three LSBs of the output are exactly the same as the LSBs of the input - so I just optimized away about 130,000 gates.

 

Can you explain what you mean?  I don't see any 1:1 mapping of between the input bits of a binary number and the packed BCD output.

 

I have no special talents.  I am only passionately curious. - Albert Einstein

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

For example, if the least significant bit of the input hex is set, then the number is odd, and therefore the least significant bit of the packed BCD is therefore set - because it's also an odd number.  Numbers larger than that do require some gating.

 

eta: And the hex input numbers 0x0000 through 0x0009 are bit-equal to the packed BCD output of 00000 through 00009.  Does that help any?

  S.

Last Edited: Fri. Feb 5, 2021 - 01:21 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Scroungre wrote:

For example, if the least significant bit of the input hex is set, then the number is odd, and therefore the least significant bit of the packed BCD is therefore set - because it's also an odd number.  Numbers larger than that do require some gating.

 

eta: And the hex input numbers 0x0000 through 0x0009 are bit-equal to the packed BCD output of 00000 through 00009.  Does that help any?

  S.

 

Yes, the last bit for even/odd is obvious, but I was looking for 3 bits that mapped 1:1 given your comment, "the three LSBs of the output are exactly the same as the LSBs of the input "

I'm guessing you didn't mean that they are always the same?

 

I have no special talents.  I am only passionately curious. - Albert Einstein

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

One of my more fundamental assumptions here is that there is absolutely no reason to convert to BCD except for display

who do you expect to read the display at that speed?

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ralphd wrote:

Yes, the last bit for even/odd is obvious, but I was looking for 3 bits that mapped 1:1 given your comment, "the three LSBs of the output are exactly the same as the LSBs of the input "

I'm guessing you didn't mean that they are always the same?

 

Well, clearly they are not.

0x10 = 00010000b

10    = 00001010b

 

edit: I think it's important to understand an algo before trying to optimize it. So I spent a while figuring out the OP algorithm, first by converting small numbers. For numbers up to 99, you need to add 6*(int(number/10) to get the BCD representation. Then it fails for 100, you need to add an extra 6*((int(number/100) << 4) to generate the high order BCD digit, and so on for each higher order nibble.

 

For example, for 523:

1) add 6*52

2) add (6*5) << 4

Last Edited: Fri. Feb 5, 2021 - 03:26 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I was interested in what the logic minimization equations are for each segment?  Is the total 200 terms? 1000 terms? 130000 terms?  Are there a lot of common terms that can be reused between segments to reduce the total number?   

What I'm saying, is if I gave you a handful of 3 input nand gates and some protoboard, what would it take to decode 16 input bits to around 35 output segment bits (5 digits, most significant digit, might not use all segments)?   Obviously there are old-school decoders for 4 bit to 7-seg.

I'm gonna warm up my soldering iron. 

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Fri. Feb 5, 2021 - 03:54 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

So to make the actual C versus Asm comparison that OP is apparently trying to do in this thread what does the recently developed algo (from that other thread) look like simply expressed in C?

 

Obviously a C build will be slower/bloatier (it's kind of bound to be when compared to hand crafted C) but it would be interesting to see a true apples for apples comparison.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:

 what does the recently developed algo (from that other thread) look like simply expressed in C?

 

From what I understood, the algo from the other thread was actually a fast division by 1000. It's use for BCD conversion was just an "excuse" to present the division method.

The algorithm from this thread is based on "there is some magic number we can add to an hex number to convert it to it's BCD representation. Here is a formula to calculate such number."

 

So they are not even that related... this one only has one instance of division by 1000, and replacing that division by the fast algo wouldn't make a big difference.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

El Tangas wrote:

ralphd wrote:

Yes, the last bit for even/odd is obvious, but I was looking for 3 bits that mapped 1:1 given your comment, "the three LSBs of the output are exactly the same as the LSBs of the input "

I'm guessing you didn't mean that they are always the same?

 

Well, clearly they are not.

0x10 = 00010000b

10    = 00001010b

 

edit: I think it's important to understand an algo before trying to optimize it. So I spent a while figuring out the OP algorithm, first by converting small numbers. For numbers up to 99, you need to add 6*(int(number/10) to get the BCD representation. Then it fails for 100, you need to add an extra 6*((int(number/100) << 4) to generate the high order BCD digit, and so on for each higher order nibble.

 

I did some experiments this morning, and came to the same conclusion.  Here's some prototype code from my notes:

 

cpi num, 10
brlo done
addi num, 6
cpi num, 26
brlo done
addi num, 6
...

 

I have no special talents.  I am only passionately curious. - Albert Einstein

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I discovered that the OP's formula can be written in a different way. I would explain how to derive it, but unfortunately I was writing that when I lost the whole post angry

 

Meanwhile, I leave this demo https://godbolt.org/z/cTWva9

 

The drop in replacement for the OP's function is:

 

// magic numbers
// tens: 6
// hundreds: 6*10 + 6*16 = 156
// thousands: 6*10*10 + 6*10*16 + 6*16*16 = 3096
// 10000s: 6*10*10*10 + 6*10*10*16 + 6*10*16*16 + 6*16*16*16 = 55536

#define M10 6
#define M100 156
#define M1000 3096
#define M10000 55536

unsigned long bin2bcd (uint32_t n)     // convert 16 bit binary n to 5 digit BCD (packed)
{
    unsigned long result = n;
    while (n >= 10000) {
        n -= 10000;
        result += M10000;
    }
    while (n >= 1000) {
        n -= 1000;
        result += M1000;
    }
    while (n >= 100) {
        n -= 100;
        result += M100;
    }
    while (n >= 10) {
        n -= 10;
        result += M10;
    }
    return result;
}

 

I didn't do any effort to optimize it, but I believe it should be somewhat faster, since there are no divisions, I used successive subtractions instead (I know, not the fastest method either).

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

s'AVR wrote:

unsigned long bin2bcd (uint32_t n)     // convert 16 bit binary n to 5 digit BCD (packed)
{
    n = n + 6 * (n/10 + 16*(n/100) + 256*(n/1000) + 4096*(n/10000));
    return n;
}

If i worked it out right, everything inside the brackets fits in 16 bits, you only need to go up to 32 bits for the * 6 and then the add, so why not

static uint32_t bin2bcd(uint16_t n)
{
    return n + 6ul * (n/10 + 16*(n/100) + 256*(n/1000) + 4096*(n/10000));
}

I got about 115us on atmega328p running at 8MHz.

You might want to double check it gives the right answer still.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Optimized repeated subtraction:

 

    ldi d5, -1 + '0'                   ; ten-thousands
1:  inc d5
    subi r20, lo8(10000)
    sbci r21, hi8(10000)
    brcc 1b

    ldi d4, 10 + '0'                   ; thousands
2:  dec d4
    subi r20, lo8(-1000)
    sbci r21, hi8(-1000)
    brcs 2b

...

 

I have no special talents.  I am only passionately curious. - Albert Einstein

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This thread is only about packed BCD !

The code you show is a ASCII version (but could easy be changed to packed).

 

next step for a faster worst case is to deal with :

(1. digit special case)

 

deal with 4000 (and adjust with 4)

then deal with -1000 on same digit.

 

deal with 400 (and adjust with 4)

then deal with -100 on same digit.

 

...

...

I think all this is in a thread somewhere, but either for ASCII or unpacked BCD, but could easily be changed to packed (save a init instruction for every 2. digit).  

  

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:

This thread is only about packed BCD !

The code you show is a ASCII version (but could easy be changed to packed).

 

Yeah, just start off with:

ldi d5, -1

...

and instead of:

  ldi d4, 10 + '0'                   ; thousands

2:  dec d4

do:

 ldi d43, 0xA0                     ; thousands and hundreds

2: subi d43, 0x10

 

Then, as you pointed out, you save an init (ldi) on the hundreds

 

I have no special talents.  I am only passionately curious. - Albert Einstein

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

Yup. I've made a mathematical discovery (new to me at least) and this is something I cherish more than "optimization contests" and the like, because it happens only rarely.

 

This algorithm in fact works for packed or unpacked BCD the same, and even for any "source base" to "target base" conversion.

 

I already mentioned in post #20 the "magic numbers"

 

El Tangas wrote:

// magic numbers
// tens: 6
// hundreds: 6*10 + 6*16 = 156
// thousands: 6*10*10 + 6*10*16 + 6*16*16 = 3096
// 10000s: 6*10*10*10 + 6*10*10*16 + 6*10*16*16 + 6*16*16*16 = 55536

 

These are for packed BCD, in which source base = 10 and target base = 16. You can see the factors are all made up from these numbers; even the "6" is the difference between bases, i.e. 16-10.

 

For unpacked BCD, source base = 10 and target base = 256, therefore the magic numbers are:

 

// magic numbers
// tens: 246
// hundreds: 246*10 + 246*256 = 65436
// thousands: 246*10*10 + 246*10*256 + 246*256*256 = 16776216
// 10000s: 246*10*10*10 + 246*10*10*256 + 246*10*256*256 + 246*256*256*256 = 4294957296

example: https://godbolt.org/z/8Tn477

 

There are other interesting applications I'm exploring. I have to thank s'AVR yes

 

edit: I will use the OP's notation, as Excel example for unpacked BCD (as I mentioned before, my notation and the OP's are equivalent)

 

=DEC2HEX(A2+246*(INT(A2/10)+256*INT(A2/100)+256*256*INT(A2/1000)+256*256*256*INT(A2/10000));10)

Last Edited: Sat. Feb 6, 2021 - 01:53 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

MrKendo wrote:

If i worked it out right, everything inside the brackets fits in 16 bits, you only need to go up to 32 bits for the * 6 and then the add, so why not

static uint32_t bin2bcd(uint16_t n)
{
    return n + 6ul * (n/10 + 16*(n/100) + 256*(n/1000) + 4096*(n/10000));
}

I got about 115us on atmega328p running at 8MHz.

You might want to double check it gives the right answer still.

Wow, that’s a big improvement. cool

Using this minor modification under GCC/AS7, the same "simple formula" is now even running at typically 13,7 µsec on Atmega328P @16 MHz!

 

I couldn’t believe it, but checked several times that all 2^16 numbers are converting correctly in the given time (per number).

Thank you for this hint, MrKendo!

 

Based on the same idea I also checked El Tangas’ modified formula for unpacked BCD by using this code:

static uint64_t bin2dec (uint16_t n)     // convert 16 bit binary n to 5 digit unpacked BCD
{
    return (n + 246ull * (n/10 + 256ul*(n/100) + 65536ul*(n/1000) + 16777216ul*(n/10000)));
}

It’s running at typically 26,5 µsec @16 MHz. So the unpacked BCD formula takes 2x the time of the packed BCD formula.

Last Edited: Mon. Feb 8, 2021 - 06:07 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

s'AVR wrote:
Using this minor modification under GCC/AS7, the same "simple formula" is now even running at typically 13,7 µsec on Atmega328P @16 MHz!

 

That seemed way faster than the 115us i was getting at 8MHz.

However, i had compiled with -Os.

I just tried -O2, code is a bit bigger, but low and behold I get about 26us at 8MHz.

The generated code is doing something clever that i haven't figured out yet, but it has got rid of all the divides.

EDIT -O3 similar to -O2

 

EDIT

it's effectively doing

/10000 as 839/128

/1000 as 8389/128

/100 as 5243/8

/10 as 52429/8

 

and this is overall scaled by 2^^16

so right shift the above by 16

 

839/128/65536 = 0.0001000165939

8389/128/65536 = 0.00100004673

5243/8/65536 = 0.01000022888

52429/8/65536 = 0.1000003815

Last Edited: Mon. Feb 8, 2021 - 10:25 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

My guess is that it mul with 1/x , but not a hacked way, if you shift up to max number the result will be correct like:

 

div with 10 on a byte you mul with 26/256 = 0,1015625  that won't be correct, but if you shift the 26 up to the highest number that fit in a byte, 205/2048 = 0,10009765625 that will correct

for all combinations of byte size.   

 

add:

ok you just added it yourself ;)

Last Edited: Mon. Feb 8, 2021 - 10:30 PM