How-To? Convert GCC InLine Arithmetic to a Function Library

Go To Last Post
5 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi,

:?: Anyone have a How-To, or some web-links, to convert GCC's in-lining of multi-register arithmetic to calls to a function library?

This was prompted because of a project where I'm doing a lot of 32-bit arithmetic, along with some 64-bit ('long long'), and I've only got 1/2 k left in the ATmega168 that we're using.

-And- the manager has added a couple of new software requirements!

And, according to the sales rep, the next Atmega device (ATmega328P) (double the Flash from 16K to 32K, etc) won't be available till end of 1st Quarter, 2008.

Best regards,
Alf Lacis :?
Embedded Software Engineer
Rinstrum Pty Ltd
41 Success Street, Acacia Ridge, QLD 4110, Australia
Ph: +61 7 3216 7166 Direct: 3710 8234
Fax: +61 7 3710 8232
Email: alf.lacis curly-at rinstrum.com
Web: www.rinstrum.com
____________________________________________________________

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I've used:

-finline-limit=0

with GCC. Might help....

C: i = "told you so";

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi, thanks, I saved 102 bytes from 15624 to 15520.

But that did not affect how GCC emits code for arithmetic.

But I did learn :oops: that the multiplication is done in a function library, but the interface to the library is pass-by-value through registers resulting in a big overhead.

For example, to perform a single 64-bit multiplication, apart from the 'call' takes 96 bytes:

  ullcee  = ullaye * ullbee;
    37e0:	a0 90 4a 01 	lds	r10, 0x014A
    37e4:	b0 90 4b 01 	lds	r11, 0x014B
    37e8:	c0 90 4c 01 	lds	r12, 0x014C
    37ec:	d0 90 4d 01 	lds	r13, 0x014D
    37f0:	e0 90 4e 01 	lds	r14, 0x014E
    37f4:	f0 90 4f 01 	lds	r15, 0x014F
    37f8:	00 91 50 01 	lds	r16, 0x0150
    37fc:	10 91 51 01 	lds	r17, 0x0151
    3800:	20 91 42 01 	lds	r18, 0x0142
    3804:	30 91 43 01 	lds	r19, 0x0143
    3808:	40 91 44 01 	lds	r20, 0x0144
    380c:	50 91 45 01 	lds	r21, 0x0145
    3810:	60 91 46 01 	lds	r22, 0x0146
    3814:	70 91 47 01 	lds	r23, 0x0147
    3818:	80 91 48 01 	lds	r24, 0x0148
    381c:	90 91 49 01 	lds	r25, 0x0149
    3820:	0e 94 ee 1e 	call	0x3ddc	; 0x3ddc <__muldi3>
    3824:	20 93 d4 03 	sts	0x03D4, r18
    3828:	30 93 d5 03 	sts	0x03D5, r19
    382c:	40 93 d6 03 	sts	0x03D6, r20
    3830:	50 93 d7 03 	sts	0x03D7, r21
    3834:	60 93 d8 03 	sts	0x03D8, r22
    3838:	70 93 d9 03 	sts	0x03D9, r23
    383c:	80 93 da 03 	sts	0x03DA, r24
    3840:	90 93 db 03 	sts	0x03DB, r25

Compare that for a 32-bit multiplication. Apart from the call the code is 48 bytes:

  ulcee   = ulaye  * ulbee;
    3700:	60 91 3a 01 	lds	r22, 0x013A
    3704:	70 91 3b 01 	lds	r23, 0x013B
    3708:	80 91 3c 01 	lds	r24, 0x013C
    370c:	90 91 3d 01 	lds	r25, 0x013D
    3710:	20 91 3e 01 	lds	r18, 0x013E
    3714:	30 91 3f 01 	lds	r19, 0x013F
    3718:	40 91 40 01 	lds	r20, 0x0140
    371c:	50 91 41 01 	lds	r21, 0x0141
    3720:	0e 94 f3 26 	call	0x4de6	; 0x4de6 <__mulsi3>
    3724:	60 93 d0 03 	sts	0x03D0, r22
    3728:	70 93 d1 03 	sts	0x03D1, r23
    372c:	80 93 d2 03 	sts	0x03D2, r24
    3730:	90 93 d3 03 	sts	0x03D3, r25

Compare this to two routines: ulmult() and ullmult()

void ulmult(unsigned long *cee, unsigned long *aye, unsigned long *bee)
{
  *cee = *aye * *bee;
}
void ullmult(unsigned long long *cee, unsigned long long *aye, unsigned long long *bee)
{
  *cee = *aye * *bee;
}

which pass values by address:

  ulmult(&ulcee, &ulaye, &ulbee);
    3806:	4e e3       	ldi	r20, 0x3E	; 62
    3808:	51 e0       	ldi	r21, 0x01	; 1
    380a:	6a e3       	ldi	r22, 0x3A	; 58
    380c:	71 e0       	ldi	r23, 0x01	; 1
    380e:	80 ed       	ldi	r24, 0xD0	; 208
    3810:	93 e0       	ldi	r25, 0x03	; 3
    3812:	0e 94 b8 1a 	call	0x3570	; 0x3570 

  ullmult(&ullcee, &ullaye, &ullbee);
    38e6:	4a e4       	ldi	r20, 0x4A	; 74
    38e8:	51 e0       	ldi	r21, 0x01	; 1
    38ea:	62 e4       	ldi	r22, 0x42	; 66
    38ec:	71 e0       	ldi	r23, 0x01	; 1
    38ee:	84 ed       	ldi	r24, 0xD4	; 212
    38f0:	93 e0       	ldi	r25, 0x03	; 3
    38f2:	0e 94 cf 1a 	call	0x359e	; 0x359e 

The call wrapper overhead drops from 96 or 48 bytes down to a consistent 12 bytes, regardless of the size of the arguments, since we are only passing 2-byte pointers, not the whole values in a bunch of registers.

Any ideas?

Alf

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Alf,

I am tempted to say that you may have to change compilers. You are going to be forever squeezed by code size.

Since you only really want 33 bit arithmetic, why not use a structure: { ulong base, char bit32 }

Build your library with structure pointers rather than longlong pointers. Nasty casts could be done with unions.

You are always going to have to watch type conversions. Using a structure should mean that the compiler should catch them for you. Using longlong will be automatically cast without warnings.

David.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi, David,
The company already has 20+ projects which use WINAVR.... they're not going to change the build enviro for one project with some 64-bit difficulties. I've got the code in there now: there's just no room for expansion (using ATmega168) unless we go up to the unreleased ATmega328 (1st Quarter 2008).
We're actually doing some integration testing now, but thanks for the 33-bit idea.
Thanks & Goodbye,
Alf Lacis
http://alfredo4570.customer.netspace.net.au