64-bit division, C/asm mix

Go To Last Post
9 posts / 0 new
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0


My project requires to program DDS FTW, which is a 28-bit word derived as:

FTW = Freq * 2**28 / Fmclk + 1

In reality, these numbers look like this (i'm using commas as thousand separator for readability):

FTW = 2,000,000 * 2**28 / 25,000,000 + 1

And those are fairly huge numbers. This calls for 64-bit integers. Now avr-gcc supports unsigned long long, aka uint64_t, but the code it produces seems not very nice to me. For example, unsigned division takes about 5kb of program space, which is unacceptable in my case. Maybe the code is optimized for speed, but somehow I have my doubts about that.

I tried several things, remembered division by corner, read some papers.. But eventually I've stumbled across this code, which is to divide 56 bit by 24 bit, almost suitable for me. The code has no author line, but it was referenced by someone named Peter Dannegger, somewhere in German-speaking forums. Anyway, the code is here:

Since my divider is 25,000,000 and this is 25 bit, I had to expand it for 32-bit divider. I adopted it to be usable within a C program too. I never did any C/assembly mixes before. I even did never do anything with AVR assembly before, so the implementation might seem very poor and this is the primary reason why I'm posting it here.

void div64(uint64_t* divident) {
	// r0:r3:	Remainder
	// r18:25: 	Dividend/Quotient
	// r8:11:	Divider (const)
	// r17:		Count

	asm volatile(
	"		ld r18, x+" 		"\n"
	"		ld r19, x+" 		"\n"
	"		ld r20, x+" 		"\n"
	"		ld r21, x+" 		"\n"
    "		ld r22, x+" 		"\n"
    "		ld r23, x+" 		"\n"
    "		ld r24, x+" 		"\n"
    "		ld r25, x+" 		"\n"

	"		ldi		r17, 0x40" 	"\n"
	"		mov		r8, r17" 	"\n"
	"		ldi		r17, 0x78" 	"\n"
	"		mov		r9, r17" 	"\n"
	"		ldi		r17, 0x7D" 	"\n"
	"		mov		r10, r17" 	"\n"
	"		ldi		r17, 0x01" 	"\n"
	"		mov		r11, r17" 	"\n"
	"div56:	clt" 		"\n"
	"		clr		r0" 		"\n"
	"		clr 	r1" 		"\n"
	"		clr		r2" 		"\n"
	"		clr		r3" 		"\n"
	"		ldi 	r17,56" 	"\n"
	"_div1:	lsl		r18" 		"\n"
	"		rol		r19" 		"\n"
	"		rol		r20" 		"\n"
	"		rol		r21" 		"\n"
	"		rol		r22" 		"\n"
	"		rol		r23" 		"\n"
	"		rol		r24" 		"\n"
	"		rol		r0" 		"\n"
	"		rol		r1" 		"\n"
	"		rol		r2" 		"\n"
	"		rol		r3" 		"\n"
	"		brcs	_div2" 		"\n"
	"		cp		r0, r8" 	"\n"
	"		cpc		r1, r9" 	"\n"
	"		cpc		r2, r10" 	"\n"
	"		cpc		r3, r11" 	"\n"
	"		brcs	_div3" 		"\n"
	"_div2:	sub		r0, r8" 	"\n"
	"		sbc		r1, r9" 	"\n"
	"		sbc		r2, r10" 	"\n"
	"		sbc		r3, r11" 	"\n"
	"		inc		r18" 		"\n"
	"		set" 		"\n"
	"_div3:	dec		r17" 		"\n"
	"		brne	_div1" 		"\n"

	"		clr 	__zero_reg__"	" ; temp++ \n"
	"		clc"				"\n"
	"		clr		__tmp_reg__"	"\n"
	"		inc		__tmp_reg__"	"\n"
	"		add		r18, __tmp_reg__"	"\n"
	"		adc		r19, __zero_reg__"	"\n"
	"		adc		r20, __zero_reg__"	"\n"
	"		adc		r21, __zero_reg__"	"\n"
	"		adc		r22, __zero_reg__"	"\n"
	"		adc		r23, __zero_reg__"	"\n"
	"		adc		r24, __zero_reg__"	"\n"
	"		adc		r25, __zero_reg__"	"\n"

	"		st		-x, r25" 	"\n"
	"		st		-x, r24" 	"\n"
	"		st		-x, r23" 	"\n"
	"		st		-x, r22" 	"\n"
	"		st		-x, r21" 	"\n"
	"		st		-x, r20" 	"\n"
	"		st		-x, r19" 	"\n"
	"		st		-x, r18" 	"\n"
    : :"x" (divident) : "memory", "r2", "r3", "r8", "r9", "r10", "r11", "r17", "r18", "r19", "r20", "r21", "r22", "r23", "r24", "r25");

So far I know that it was not a good idea to use r1 since it's a __zero_reg__ and some interrupt code might be counting on it. This code is not very universal because the divisor is always a constant and it's hard-coded in there. The "+1" code is there too - somehow avr-gcc's implementation of 64-bit "++" is not the smallest one around either, takes several pages of funny code.

Please share your criticism and suggestions/improvements! And if you know why avr-gcc 64-bit libs are so nasty, share that knowledge too.

P.S. I figure this better goes into the general AVR forum because it has to do with assembly, even though it's gcc-centric. But if you feel that it's rather for gcc forum, I don't mind it being moved.

The Dark Boxes are coming.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I addressed this exact issue recently. GCC's handling of 64 bit arithmetic is, to put it nicely, somewhat suboptimal. For a division , in addition to the 5k of code you mention, 256 bytes of ram are consumed! A quick look through the gcc and avrlibc code code revealed that generic routines were being used, which may well be OK on 32bit systems, are totally unusable on AVRs. For a DDS system I was working on I ended up extending a 24 div I found somewhere and extending it to 64 bits. I also wrote my own 64 bit add, subtract routines for similar reasons.

Using r1 (__zero_reg__) is not a problem, as gcc interrupt code will save and zero this register regardless of whether it is used in the interrupt.


Your message here - reasonable rates.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks Jim! I had no idea about 256 bytes of ram, that's just awful! I just checked and turns out that simple ++ and << operations take awful lot of funny wrongdoing.

The Dark Boxes are coming.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Your code looks great for your initial foray into assembly.

Sizewise, it's roughly as small as it can get. There about 9-10 lines that can be removed.

With some algorithm alterations, it could execute much faster.

But first, some questions:

svofski wrote:
My project requires to program DDS FTW, which is a 28-bit word

Are you saying that FTW is 28 bits?
In the code posted above, you are calculating 56 bits.

From your description, I assume that Freq is a 28 bit number. Is that true?
What is it's minimum value?
Is the max value 0x0FFFFFFF?

[Edit] Thirdly:
I'm not sure of the calculation you require. Given the values from above ( Freq = 2,000,000 ; Fmclk = 25,000,000 ), what is the resulting value of FTW?


  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks for the compliment. But the code, except for C-embedding and some initial values is not mine. The original is linked in my original posting. That explains its compactness :)

Yes, full 56 bits are calculated and probably the calculation can be cut down to lower 28 bits. 56 bits is there because the dividend is 50bits typical.

Some sample FTW values:
Freq = 2e6 Hz: FTW = 21474837 (0x147ae15)
Freq = 1e6 Hz: FTW = 10737419 (0xa3d70b)

The Dark Boxes are coming.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ok, that makes sense.

Two points:
1) All you really need is the first 28 fractional bits of Freq/Fmclk.
2) Since you are throwing out the integer portion, I assume that it's always equal to zero and that Freq is always less than Fmclk.

Given those points, I suggest the following optimizations/alterations:
A) You can skip the integer calculation for the Quotient. That would be the first 28 loop-iterations! You only need 1 iteration for each fraction-bit, for a total of 28 iterations.
B) Since you only need the 28 fractional bits, the Quotient can be stored in a uint32_t.
C) Instead of passing the 56/64bit 'Divident' as a parameter, I would pass the 28bit 'Freq'. Then do the 'multiply by 2**28' within this function.
D) Declare your function as: uint32_t CalcFTW( uint32_t Freq ) I think that passing by value works out best in this case.
E) Implement the 'multiply by 2**28' as a no-op: Shift a zero into 'Divident' on each of the 28 iterations of the loop. Start the loop with: Divident = Freq.
F) No code anywhere in this function requires uint64_t. Use uint32_t or smaller.

And finally, I have a suggestion that's really just a personal preference. Since the entire function is written in assembly, why use inline-assembly? I'd suggest writing it in regular assembly, within a *.S source file. That would get rid of all of the "\n"s and other cruft. The only downside is that you need to be familiar with GCC's parameter passing and register use.

I've ready worked out the code. Should I post it? Or would you prefer the learning experience of working it out for yourself? :)


  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I thought it would be helpful to post my algorithm the form of C:

uint32_t CalcFTW( uint32_t Freq )
    uint32_t Dividend = Freq;
    uint32_t Quotient = 0;
    uint32_t Divisor = 25000000;

    uint8_t BitCount;

    for( BitCount=28; BitCount!=0; BitCount-- ) {
        Dividend <<= 1;
        Quotient <<= 1;

        if( Dividend >= Divisor ) {
            Dividend -= Divisor;
            Quotient |= 1;

    return Quotient + 1;

It turns out that GCC does an excellent job of compiling this! [at '-O2']

It executes in roughly 625 cycles.

Writing this directly in Asm would save ~30 cycles and a few instruction words. Deciding if it's worth the effort is up to you.


Edit: Upon closer inspection of the compiled code, I would say it's good, not excellent. Going to Asm would save over 100 cycles.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Brian, thanks a lot for this. Your algorithm looks excellent, and the ideas you use make me feel sorry that I haven't thought of that myself. So far I'm satisfied with what I have at the moment, but when time allows I will definitely take this exercize and try to implement this in assembly.

As for inline asm vs. standalone asm -- I agree, but I don't have any examples handy and I don't really know how to get function parameters and return value. That's why I used inline assembly.

The Dark Boxes are coming.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You're welcome, svofski. It was fun problem that I couldn't resist.

Here's everything I know about passing and returning data between GCC and assembly functions.

The main gotcha is that 8bit parameters and return values are only placed in even registers. In addition, all 8bit return values must be sign or zero extended to 16bits.

Here's a quick example using the declaration for the CalcFTW function:

uint32_t CalcFTW( uint32_t Freq );

In this case, Freq would be passed in R22/23/24/25, and the return value would also be placed in R22/23/24/25.
There's no correlation between the the storage locations of the parameters and return values. They only happen to share the same registers because values are of the same size.