Huge code increase for 64-bit maths (long long)

Go To Last Post
13 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi,
Had a quick look in the forums (but I got over 5000 hits for "long long", then figured that it wasn't quoting the search string properly) but could not see anything. :?:

The following converted from 32-bit to 64-bit, even using -Os, adds about 1K of code, which on an ATmega is significant.

Is there a way to get the compiler to use library routines for these, instead of in-lining or open-coding these, please?

Best regards,
Alf Lacis
Embedded Software Engineer
Rinstrum Pty Ltd
41 Success Street, Acacia Ridge QLD 4110 Australia
Ph: +61 7 3216 7166 Direct: 3710 8234
Fax: +61 7 3710 8232
Email: alf.lacis at rinstrum.com
Web: www.rinstrum.com

// Changing from 32-bit to 64-bit adds about 1K, even at -Os optimisation:
===========================================================================

    ullRecIDTail = ((int64u)pToggle->recIDTail) + (int64u)ULONG_MAX + 1ULL;
    ullXRecID    = (int64u)x.recID;
    ullVProbeID  = (int64u)pvProbe->recID;

    // adjust numbers if head wrapped around...
    if ( x.recID <= pToggle->recIDTail )
        ullXRecID += (int64u)ULONG_MAX + 1ULL;
    if ( pvProbe->recID <= pToggle->recIDTail )
        ullVProbeID += (int64u)ULONG_MAX + 1ULL;

    if ( ullVProbeID < ullXRecID )  
        ...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I am assuming:
Your structure members are probably regular 16 or 32 bit quantities. You want to use 64 bit arithmetic for intermediate calculations.

You could declare your temporaries as 64bit and the assignments should cast automatically for you.

You appear to add 2^32. You could OR it if you wanted. Is it really necessary to have 64 bit precision ? Place this ADD operation into a function if you want to avoid the inline code.

HTH David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
The following converted from 32-bit to 64-bit, even using -Os, adds about 1K of code, which on an ATmega is significant.

Significant, maybe. Huge, no, not for the functionality that you have invoked. A 64 bit math routine requires at least 12 registers, along with all the manipulation of all those registers. And you are appearing to want 64 bit math simply to add 1 bit of resolution. I would think that you could find a better way of doing this.

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

What I really wanted to know was can I get the compiler/linker to librarify (yes, I know that's not a word... yet) the 64-bit functionality?

Because it appears to be inlined#, thus bulking everything out, especially since I need 64-bit addition, and 64-bit comparisons, etc, in a couple more places throughout the application.

Librarifying these would reduce the impact.

----------------
# I added the first 64-bit code similar to the above: code footprint grew by nearly a K.
I then added the code shown: code footprint grew by about another 900 bytes.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Do it yourself:

int64u Add64(int64u a, int64u b)
{
    return a + b;
}

char Compare64(int64u a, int64u b)
{
    if (a > b) return 1;
    else if (b > a) return -1;
    else return 0;
}

Regards,
Steve A.

The Board helps those that help themselves.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I assumed that there was a better way of doing it, by telling the Compiler & Linker:
"Please don't inline 64 bit stuff".

Anybody know how? Thanks in advance.

Also, thanks for the suggestion, I already thought of that: It makes the code look ugly and harder to debug (see below):

Ugly Source:

void ull_64_32(ULL *pto, ULONG from)
{
  *pto = (ULL)from;
}
void ull_64_plus_64(ULL *pto, ULL *pa, ULL *pb)
{
  *pto = *pa + *pb;
}
bool ull_64_le_64(ULL *pa, ULL *pb)
{
  return *pa <= *pb;
}
bool ull_64_lt_64(ULL *pa, ULL *pb)
{
  return *pa < *pb;
}
bool ull_64_in_between(ULL *pl_exclusive, ULL *pr_exclusive, ULL *px)
{
  return ull_64_lt_64(pl_exclusive, px) && ull_64_lt_64(px, pr_exclusive);
}

Ugly Header:

#ifndef ull_lib_dot_h
#define ull_lib_dot_h 1
#include "ezitypes.h"
#include "types.h"
void    ull_64_32        (ULL *pto, ULONG from);
void    ull_64_plus_64   (ULL *pto, ULL *pa, ULL *pb);
bool    ull_64_le_64     (ULL *pa, ULL *pb);
bool    ull_64_lt_64     (ULL *pa, ULL *pb);
bool    ull_64_in_between(ULL *pl_exclusive, ULL *pr_exclusive, ULL *px);
#define ull_64_ge_64(pa,pb)  ull_64_le_64(pb,pa)
#define ull_64_gt_64(pa,pb)  ull_64_lt_64(pb,pa)
#endif

[/u][/b]

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I quite see your point that your proposed code looks ugly.

Surely you can write your functions without the indirections. The compiler should pass the whole 64bits as a parameter. You may choose to pass the return value in a global variable.

I have not tested this. But it is worth trying. Anyway you only seem to be doing additions and comparisons, and one cast.

The cast should be handled by the compiler. The other operations are pretty cheap too. Can you not modularize your code so that you keep 32bits and use one carry variable.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi, David,
Yes, I thought indirections, in calling a function, since passing an ATmega pointer is 2 bytes, whereas a passing a long long is 8 bytes?

I checked this with a test app, with optimisation = -Os. Application is:

#include "alf_defines.h"
#include "alflb.h"
#include "types.h"
#include "ull_lib_test.h"
unsigned long long a =  5ull;
unsigned long long b = 12ull;
unsigned long long c =  0ull;
int main(void (*ff)(void))
{
  ull_64_plus_64(&c, &a, &b);   // pass by pointer
  // pass result to another function to prevent removal by optimisation
  null(&c);
  c = ull_64_plus_64_value(a, b); // pass by value
  null(&c);
  while(1)
    ;
}

Addition by pointer, with void return results in 0x162 = 354 bytes of code:

.text          0x000002e4      0x162 ull_lib_ptr_add.o
                0x000002e4                ull_64_plus_64

Addtion by value, with value in the return results in 0x150 = 336 bytes of code (18 bytes smaller):

.text          0x00000194      0x150 ull_lib_value_add.o
                0x00000194                ull_64_plus_64_value

Where you save are the calls: pass by pointer, with void return is 0x12 = 18 bytes:

  ull_64_plus_64(&c, &a, &b);   // pass by pointer
  e4:	c2 e1       	ldi	r28, 0x12	; 18
  e6:	d1 e0       	ldi	r29, 0x01	; 1
  e8:	48 e0       	ldi	r20, 0x08	; 8
  ea:	51 e0       	ldi	r21, 0x01	; 1
  ec:	60 e0       	ldi	r22, 0x00	; 0
  ee:	71 e0       	ldi	r23, 0x01	; 1
  f0:	ce 01       	movw	r24, r28
  f2:	0e 94 72 01 	call	0x2e4	; 0x2e4  

Pass by value, with value in the return, 0x84 = 132 bytes:

  c = ull_64_plus_64_value(a, b); // pass by value
  fe:	20 91 08 01 	lds	r18, 0x0108
 102:	30 91 09 01 	lds	r19, 0x0109
 106:	40 91 0a 01 	lds	r20, 0x010A
 10a:	50 91 0b 01 	lds	r21, 0x010B
 10e:	60 91 0c 01 	lds	r22, 0x010C
 112:	70 91 0d 01 	lds	r23, 0x010D
 116:	80 91 0e 01 	lds	r24, 0x010E
 11a:	90 91 0f 01 	lds	r25, 0x010F
 11e:	20 90 00 01 	lds	r2, 0x0100
 122:	30 90 01 01 	lds	r3, 0x0101
 126:	40 90 02 01 	lds	r4, 0x0102
 12a:	50 90 03 01 	lds	r5, 0x0103
 12e:	60 90 04 01 	lds	r6, 0x0104
 132:	70 90 05 01 	lds	r7, 0x0105
 136:	80 90 06 01 	lds	r8, 0x0106
 13a:	90 90 07 01 	lds	r9, 0x0107
 13e:	a2 2e       	mov	r10, r18
 140:	b3 2e       	mov	r11, r19
 142:	c4 2e       	mov	r12, r20
 144:	d5 2e       	mov	r13, r21
 146:	e6 2e       	mov	r14, r22
 148:	f7 2e       	mov	r15, r23
 14a:	08 2f       	mov	r16, r24
 14c:	19 2f       	mov	r17, r25
 14e:	22 2d       	mov	r18, r2
 150:	33 2d       	mov	r19, r3
 152:	44 2d       	mov	r20, r4
 154:	55 2d       	mov	r21, r5
 156:	66 2d       	mov	r22, r6
 158:	77 2d       	mov	r23, r7
 15a:	88 2d       	mov	r24, r8
 15c:	99 2d       	mov	r25, r9
 15e:	0e 94 ca 00 	call	0x194	; 0x194 
 162:	20 93 12 01 	sts	0x0112, r18
 166:	30 93 13 01 	sts	0x0113, r19
 16a:	40 93 14 01 	sts	0x0114, r20
 16e:	50 93 15 01 	sts	0x0115, r21
 172:	60 93 16 01 	sts	0x0116, r22
 176:	70 93 17 01 	sts	0x0117, r23
 17a:	80 93 18 01 	sts	0x0118, r24
 17e:	90 93 19 01 	sts	0x0119, r25

Regs,
Alf

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

dear Alf,

Your results are fairly conclusive. Avoid long long.

Depending on your application (#calls, execution time, amount of flash space) you will choose indirection or pass by value.

From a readability point of view, I dislike the indirection. I would suggest that you hide this with macros. i.e. write your code as value, but the functions are implemented as macros. You can conditionally compile for either case.

From an application point of view, do you really need 64 bits ?

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

David,

I don't think "avoid long long": just balance the ease of development versus the size of the code footprint. It fits easily now, with room for debugging code, so it's all good.

I have a saying on my website: "High-level source code is meant to be created, read, understood and maintained by human engineers. If we were only interested in the way that computers did things, we'd still be programming in Assembler."

I had a requirement to do a binary search for Record Numbers on external Serial Flash. The Record Numbers, stored at the head of the Records, are unsigned longs, and after hitting ULONG_MAX, they wrap back to zero. The list is managed so that the newest(tail) entry doesn't overrun the oldest(head) entry. (The second layer which manages slot numbers I won't discuss here.)

So the Record Numbers could be in the range 1 billion to 3 billion -- easy, peasy as Jamie Oliver says.

When we have Record Numbers in the range ULONG_MAX minus some value to Zero plus some value, then things start getting interesting. e.g. 4 billion to 1 billion.

It was easiest to just say, "is this record number less than the oldest record? Then add 2^32.", which is ULONG_MAX+1.

Modified & paraphrased from Robert Sedgewick ("Algorithms in C") the pivotal test is:

    while ( right >= left ) ##
    {
        x = (left + right) / 2; ##
        if ( probeValue == x's key ) 
            return x's value;
        if ( probeValue < x's key ) ##
            right = x;  // Sedgewick uses x-1, but record x-1 has
        else            // not been read from Flash yet.
            left  = x;  // Sedgewick uses x+1: same comment.
    }

Each of the comparisons or arithmetic shown with ## are easier to code & debug if they have been 'normalised', from a wrap-around list, to an ordinal list of numbers: simplest solution is to use 64-bits. Now that I'm using the ULL library, I can fit the code in to the ATmega168. Also not shown is the insertion management code, which also does overrun and range checking, etc.

Also, not shown, are fetches from the Flash to get the addressed records, and to get left and right records, which is why I don't use x-1 & x+1, because that doubles the number of fetches, i.e., doubles the time to perform the binary search -- Sedgewick's algorithm assumed that fetches had zero time impact.

Any simpler, lighter, idea for normalising the Record Number range, for x, right & left?

Alf

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

...simplest solution is to use 64-bits. Now that I'm using the ULL library, I can fit the code in to the ATmega168. ...

These ARE MICROcontrollers after all. If you can be clever, then by all means squeeze it in. If you are that clever, then you should be able to manipulate the source (as you have posted) or roll-your-own if you think your compiler ain't good enough, or make your own long-long library and routines, and/or you own paradigm of how they are passed & processed and returned. Or help to improve the open-source compiler yourself.

Or re-think your app. "And if thine eye offend thee, pluck it out..." "Bank" your database or data set into 32-bit or even 16-bit pieces. Or other fresh-look approaches. Frankly, to me AVRs ain't even microPROCESSORs but rather microCONTOLLERs with a set of tasks that can conveniently be handled. My quick response is to "pluck it out" and use a superior tool for the task at hand.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I understand that you have effectively in flash:
typedef struct {
ulong record_number, record_slot
} RECORD;

the beginning area of your flash is devoted to an array of these RECORDs sorted by record_number, with some housekeeping for the size of the array.

I presume that this array is sorted because otherwise you could not binary-search. Every insertion requires shuffling the RECORD array in the flash, but the slots are presumably just allocated in time order.

So you can identify page zero which holds record_numbers <= ULONG_MAX, and subsequent pages > ULONG_MAX. Just binary search page 0 and if not found binary search page 1 ...

Alternatively you can use a millenium strategy. records < 80 are obsolete and are re-used to refer to year 2000 .. 2079. You just have to have a special comparison routine to keep these records in order.

I know that this sounds like a kludge, but you are already differentiating between two types of ulong.

Anyway I am pleased that you have overcome your size problem.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

David,
Yes, records arrive sorted, & are rejected if the Record ID is not +1 to +1000 from the previous last record.

The 'directory' in FRAM simply holds the head and tail slot numbers & head and tail Record ID Numbers.

The records themselves consist of a header (I called it a packet) which is:

typedef struct
{
  int16u  crcRecID;  // CRC of recID.
  RecID_t recID;     // Record ID number (not the location index).
  int16u  crcRecord; // CRC of the record buffer (payload). Does not include this packet header.
} DsdMemRecordPacket_t;

... and the payload, which is just an unsigned char pointer to a buffer.

When writing records to Flash, the next available slot number is calculated, and hence the Flash address. The DsdMemRecordPacket_t is written beginning at this Flash address, then the payload buffer.

Then the tail info (updated in the directory) is written to one of a pair of toggled FRAM locations. Then one byte consisting of the toggle (0 or 1) is written atomically to FRAM.

If there's an outage while writing a record, the next new record will find that the Flash footprint is not empty, erase it to 0x00's & skip it.

When doing a binary search, I look at the two 'well-known' locations from FRAM, & if it's not the head or the tail, can start doing a binary search, which is the time-expensive part. Since the device will only hold a few million records, the unsigned long recID will not wraparound more than once in storage.

When pulling records in from Flash, I only pull in crcRecID & recID, & look at the Record Number (recID).

Lee,
The project was started and stopped twice before I got to it. Hardware was set in place. Several prototypes were made up in a box. A requirement was "There will be no hardware changes." And "It will not lose records, except the last one during a powerfail". "It will not require reformatting once installed". "It is 'set & forget'." Etc, etc, etc. I did look at the ATmega to see what its instruction set supported by way of multi-register arithmetic, and the answer seems to be 'not much'. Hence the big code size. Which I now realise the reason for.

My question has morphed from "why is it so big" to "why not librarify the maths code", and then only cop one increase in code size, not 'N' copies of identical code, because the maths is done in op-codes, without subroutines. This same comment applies, by the way, to other maths, such as 'ordinary' 32 bit longs, but the impact is smaller due to the smaller number of registers.

Maybe it is an argument that GCC should spit out function calls, then do the math in a library.

Now... who do I see about changing that...?

Sorry for the long reply, but the project itself has a long history, and sometimes we just have to do like some proverb says: "Help me change the things I can change, and accept the things I cannot change, and give me the ability to tell the difference." (or something like that)

Anyway, I'm off to buy a lawnmower,
Ava good weekend,
Alf
http://alfredo4570.customer.netspace.net.au