Poor Code generation for M0 Arm Cores

Go To Last Post
21 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Using GCC ARM Embedded from https://launchpad.net/gcc-arm-em... (Which is what arduino and and everyone else uses) produces some very poor code density for Cortex M0 and M0+ at the moment:

 

https://bugs.launchpad.net/gcc-a...

 

What it is doing is producing redundant entries in the literal tables (which consumes excessive code space) and then uses code which would otherwise be not needed to read those values from the table which reduces performance and increases register pressure.

 

This is an example of what its doing:

int main (void)
{
    const uint16_t p1 = 0x1234;
    const uint16_t p2 = 0x9876;
    const uint32_t p3 = 0x12349876;

    const uint32_t p4 = 0x21;
    const uint32_t p5 = 0x33;

    volatile uint16_t* const first16 = (uint16_t*)(0x40002800U);
    volatile uint32_t* const first   = (uint32_t*)(0x40002800U);
    volatile uint32_t* const second  = (uint32_t*)(0x40002804U);
    volatile uint32_t* const third   = (uint32_t*)(0x40002808U);
    volatile uint32_t* const fourth  = (uint32_t*)(0x4000280CU);

    *first16 = p1;
    *first   = p1;
    *first16 = p2;
    *second  = p3;
    *third   = p4;
    *fourth  = p5;

    while (true) {}
}

Generates at -Os, for Cortex m0+ this :

00000118 <main>:
    volatile uint32_t* const first   = (uint32_t*)(0x40002800U);
    volatile uint32_t* const second  = (uint32_t*)(0x40002804U);
    volatile uint32_t* const third   = (uint32_t*)(0x40002808U);
    volatile uint32_t* const fourth  = (uint32_t*)(0x4000280CU);

    *first16 = p1;
 118:	4b07      	ldr	r3, [pc, #28]	; (138 <main+0x20>)
 11a:	4a08      	ldr	r2, [pc, #32]	; (13c <main+0x24>)
 11c:	801a      	strh	r2, [r3, #0]
    *first   = p1;
 11e:	601a      	str	r2, [r3, #0]
    *first16 = p2;
 120:	4a07      	ldr	r2, [pc, #28]	; (140 <main+0x28>)
 122:	801a      	strh	r2, [r3, #0]
    *second  = p3;
 124:	4a07      	ldr	r2, [pc, #28]	; (144 <main+0x2c>)
 126:	4b08      	ldr	r3, [pc, #32]	; (148 <main+0x30>)
 128:	601a      	str	r2, [r3, #0]
    *third   = p4;
 12a:	2221      	movs	r2, #33	; 0x21
 12c:	4b07      	ldr	r3, [pc, #28]	; (14c <main+0x34>)
 12e:	601a      	str	r2, [r3, #0]
    *fourth  = p5;
 130:	4b07      	ldr	r3, [pc, #28]	; (150 <main+0x38>)
 132:	3212      	adds	r2, #18
 134:	601a      	str	r2, [r3, #0]
 136:	e7fe      	b.n	136 <main+0x1e>
 138:	40002800 	.word	0x40002800
 13c:	00001234 	.word	0x00001234
 140:	ffff9876 	.word	0xffff9876
 144:	12349876 	.word	0x12349876
 148:	40002804 	.word	0x40002804
 14c:	40002808 	.word	0x40002808
 150:	4000280c 	.word	0x4000280c

 

All of the addresses in the table at the end of the function, after the first,  are not required.  They can all be generated using immediate offsets from the first, which would save 12 bytes from the table itself, and at least the 3 instructions from 126, 12C and 130, for a further 6 byte saving.  The whole routine is only 60 bytes, so a saving of 18 bytes is pretty huge (1/3).  This is a pretty trivial example, but it is FAR from optimal and will result in larger and slower programs for M0plus cores like the D21/L21, etc.  This problem does not exhibit with the latest GCC ARM Embedded for the M3 and M4 cores, only M0,M0+ and M1. 

 

If you are using M0 type cores, like the D21 on the Arduino Zero, and would like them to generate faster, denser code you can do yourself a favour and up-vote the bug report so it gets more attention.

 

Last Edited: Thu. Oct 15, 2015 - 11:31 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm not sure that this is an optimization that I expect the compiler to make, although I'm not exactly sure why not (Hmm.  Perhaps because addresses are normally subject to resolution at link time?)

If you use the structure-like definitions for io registers, things will be optimized the way you want.  https://www.avrfreaks.net/comment...

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I made a similar complaint about the PIC32 compiler (also gcc), but it turns out that in its case, the io register addresses WEREN'T defined until link time, so that was a lost cause :-(

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If it generates particular optimisations for Cortex M3, I would expect same optimisations for M0, unless of course the instructions aren't available.

Bob.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If it generates particular optimisations for Cortex M3

Ah.  I missed  that part.  It does.

I see that M0 has a much more restricted set of values for the offset than m3...  Maybe that has something to do with it?

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

So i have reduced my test code to explore the issue further.  I think its a different problem to the bug report I added to, although it is Similar.

 

It is, as far as i am concerned, a bug in the optimiser for M0/M0+/M1.

 

Take this as an example (all are build using -Os) :

/* Write 32 bit values to known register locations - using an array */
void test2(void)
{
    const uint32_t v1 = 0x80001001; // First Value
    const uint32_t v2 = 0x80001010; // Second Value
    const uint32_t v3 = 0x80001100; // Third Value

    volatile uint32_t* const r = (uint32_t*)(0x40002800U); /* Register Array*/

    r[0] = v1;
    r[1] = v2;
    r[2] = v3;
}

For cortex M0+ GCC 4.9 2015Q3 generates this code (44 bytes long):

 

0000016c <test2>:
    const uint32_t v2 = 0x80001010; // Second Value
    const uint32_t v3 = 0x80001100; // Third Value

    volatile uint32_t* const r = (uint32_t*)(0x40002800U); /* Register Array*/

    r[0] = v1;
 16c:	4a04      	ldr	r2, [pc, #16]	; (180 <test2+0x14>)
 16e:	4b05      	ldr	r3, [pc, #20]	; (184 <test2+0x18>)
 170:	601a      	str	r2, [r3, #0]
    r[1] = v2;
 172:	4a05      	ldr	r2, [pc, #20]	; (188 <test2+0x1c>)
 174:	4b05      	ldr	r3, [pc, #20]	; (18c <test2+0x20>)
 176:	601a      	str	r2, [r3, #0]
    r[2] = v3;
 178:	4a05      	ldr	r2, [pc, #20]	; (190 <test2+0x24>)
 17a:	4b06      	ldr	r3, [pc, #24]	; (194 <test2+0x28>)
 17c:	601a      	str	r2, [r3, #0]
}
 17e:	4770      	bx	lr
 180:	80001001 	.word	0x80001001
 184:	40002800 	.word	0x40002800
 188:	80001010 	.word	0x80001010
 18c:	40002804 	.word	0x40002804
 190:	80001100 	.word	0x80001100
 194:	40002808 	.word	0x40002808

If I change it to build as a M3, it generates (24 bytes long):

 

00000160 <test2>:
    const uint32_t v2 = 0x80001010; // Second Value
    const uint32_t v3 = 0x80001100; // Third Value

    volatile uint32_t* const r = (uint32_t*)(0x40002800U); /* Register Array*/

    r[0] = v1;
 160:	4b03      	ldr	r3, [pc, #12]	; (170 <test2+0x10>)
 162:	4a04      	ldr	r2, [pc, #16]	; (174 <test2+0x14>)
 164:	601a      	str	r2, [r3, #0]
    r[1] = v2;
 166:	320f      	adds	r2, #15
 168:	605a      	str	r2, [r3, #4]
    r[2] = v3;
 16a:	32f0      	adds	r2, #240	; 0xf0
 16c:	609a      	str	r2, [r3, #8]
 16e:	4770      	bx	lr
 170:	40002800 	.word	0x40002800
 174:	80001001 	.word	0x80001001

 

ALL of those instructions are available on the Cortex M0, so as far as i am concerned GCC is emitting bad code for the M0 because it CAN emit good code for a related processor and not use any more instructions than are available on the M0.  I accept that M3 has a broader range of instructions, but this test case doesn’t cause the compiler to emit them.  There is no VALID technical reason the compiler didn’t generate this exact same code for M0, except for a bug.

 

This following case is particularly concerning for me, because the test code is just an array of bytes, if GCC cant optimise an array into an index write with LDR then things are pretty hopeless.  I tested this with an array of uint8_t like this:

 

/* Write 8 bit values to known register locations - using an array */
void test6(void)
{
    volatile uint8_t* const r = (uint8_t*)(0x40002800U); // Register Array

    r[0]   = 0xFF;
    r[1]   = 0xFE;
    r[2]   = 0xFD;
    r[3]   = 0xFC;
    r[4]   = 0xEE;
    r[8]   = 0xDD;
}

For M0 GCC produces this (62 bytes):

000001fc <test6>:
/* Write 8 bit values to known register locations - using an array */
void test6(void)
{
    volatile uint8_t* const r = (uint8_t*)(0x40002800U); // Register Array

    r[0]   = 0xFF;
 1fc:	22ff      	movs	r2, #255	; 0xff
 1fe:	4b09      	ldr	r3, [pc, #36]	; (224 <test6+0x28>)
 200:	701a      	strb	r2, [r3, #0]
    r[1]   = 0xFE;
 202:	4b09      	ldr	r3, [pc, #36]	; (228 <test6+0x2c>)
 204:	3a01      	subs	r2, #1
 206:	701a      	strb	r2, [r3, #0]
    r[2]   = 0xFD;
 208:	4b08      	ldr	r3, [pc, #32]	; (22c <test6+0x30>)
 20a:	3a01      	subs	r2, #1
 20c:	701a      	strb	r2, [r3, #0]
    r[3]   = 0xFC;
 20e:	4b08      	ldr	r3, [pc, #32]	; (230 <test6+0x34>)
 210:	3a01      	subs	r2, #1
 212:	701a      	strb	r2, [r3, #0]
    r[4]   = 0xEE;
 214:	4b07      	ldr	r3, [pc, #28]	; (234 <test6+0x38>)
 216:	3a0e      	subs	r2, #14
 218:	701a      	strb	r2, [r3, #0]
    r[8]   = 0xDD;
 21a:	4b07      	ldr	r3, [pc, #28]	; (238 <test6+0x3c>)
 21c:	3a11      	subs	r2, #17
 21e:	701a      	strb	r2, [r3, #0]
}
 220:	4770      	bx	lr
 222:	46c0      	nop			; (mov r8, r8)
 224:	40002800 	.word	0x40002800
 228:	40002801 	.word	0x40002801
 22c:	40002802 	.word	0x40002802
 230:	40002803 	.word	0x40002803
 234:	40002804 	.word	0x40002804
 238:	40002808 	.word	0x40002808

and for M3 we got this (32 bytes):

000001c8 <test6>:
/* Write 8 bit values to known register locations - using an array */
void test6(void)
{
    volatile uint8_t* const r = (uint8_t*)(0x40002800U); // Register Array

    r[0]   = 0xFF;
 1c8:	4b06      	ldr	r3, [pc, #24]	; (1e4 <test6+0x1c>)
 1ca:	22ff      	movs	r2, #255	; 0xff
 1cc:	701a      	strb	r2, [r3, #0]
    r[1]   = 0xFE;
 1ce:	22fe      	movs	r2, #254	; 0xfe
 1d0:	705a      	strb	r2, [r3, #1]
    r[2]   = 0xFD;
 1d2:	22fd      	movs	r2, #253	; 0xfd
 1d4:	709a      	strb	r2, [r3, #2]
    r[3]   = 0xFC;
 1d6:	22fc      	movs	r2, #252	; 0xfc
 1d8:	70da      	strb	r2, [r3, #3]
    r[4]   = 0xEE;
 1da:	22ee      	movs	r2, #238	; 0xee
 1dc:	711a      	strb	r2, [r3, #4]
    r[8]   = 0xDD;
 1de:	22dd      	movs	r2, #221	; 0xdd
 1e0:	721a      	strb	r2, [r3, #8]
 1e2:	4770      	bx	lr
 1e4:	40002800 	.word	0x40002800

For M0, Every single array entry has its own address stored in the Literal Table.  I wouldn't be thrilled to see this even with optimisation turned off, let alone with -Os.  No one is going to convince me this isn't broken code generation for M0.  For M3, it produced nice LDR with offsets, and DIDN'T use any instructions not available on the M0.  I will put more details on the bug report I am going to raise in the ARM GCC bugzilla.  I will post a link of the bug report once I raise it.  I am just working on a minimal test case to demonstrate it for the report.

 

These arrays and values are constants, this is not a linker issue as all of the addresses are known at compile time, the only thing the compiler doesn't know is the address of the literal table itself, and it doesn't need to know that because its using PC relative addressing to access it anyway.  Also, the linker cant be responsible for this difference in code generated between the M0 and M3, given the exact same input.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Looks like a solid analysis. I can't see any reason why this optimisation could not be generated. The allowed addressing modes with ldr/str are different between M0 and M3, but the M0 should allow it (ref http://infocenter.arm.com/help/i...)

 

I had a quick look through the ARM GCC backend and I couldn't find where these optimisations are generated. My guess is there is a base description for M0 instructions (v6m), and an extension for M3 (v7m) which includes extra optimisations.

Bob.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ok I filed the bug report.  Really the M0 code generation is terrible and it seems completely unnecessary.  My test case does not generate any M3 specific instruction when i build for M3, and I verified that by assembling the  generated code with GAS as an M0 and it assembles fine.

 

The bug report is here.

https://bugs.launchpad.net/gcc-a...

 

The problems I identified lead to excessive and uncontrollable code size increases, and performance is also negatively effected by the excessive loads being generated from flash.

 

I tested 6 different access patterns to memory mapped registers:

Test 1 - Fixed pointers to known locations:
Test 2 - Accessing the registers as an array
Test 3 - Accessing the registers as a structure
Test 4 - Accessing as an array where the array element (register) is a union type.
Test 5 - Accessing the registers as a structure where each register is a union type.
Test 6 - Just writing contiguous bytes in memory as an array.

 

All of them generate sub optimal code, for various reasons when compared to code generated for the M3.

 

For example, my 6 test case object compiles to 184 bytes on M3, but for M0 it compiles to 308 bytes.  And I repeat, the M3 code doesn’t use any instruction or addressing mode not also available on the M0.

 

This was tested on 4.9-2015-q3 build of the tools from GCC ARM Embedded.  I am travelling and don’t have good internet so i cant download GCC head and build my own toolchain to see if its fixed in head.  It would be awesome if anyone had those tools could run the test case (which i attached to this post) to verify if its fixed in a later version or not.

 

Otherwise, like i said in my first post, anyone who cares at all about code generation for Cortex M0/M0+/M1 would be doing themselves a favour by up-voting the bug so it gets appropriate attention.

 

Now i need to write some assembler macros so i can try and manually fix this in my code (at least partially) in advance of a fixed version of the compiler, which may or may not every eventuatue.

Attachment(s): 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

Upvoted :)

 

Hopefully it gets a better response than "change your code"! If ARM want M0 to replace 8 bitters, then it seems like these easy improvements should be a priority.

 

I've been digging through the backend code https://gcc.gnu.org/viewcvs/gcc/..., somewhere in there must be a rule that applies the optimisation to M3 but not M0.

Bob.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

donotdespisethesnake wrote:

Upvoted :)

Appreciate it, thanks.

 

Quote:

Hopefully it gets a better response than "change your code"! If ARM want M0 to replace 8 bitters, then it seems like these easy improvements should be a priority.

I would think so too, and given M0 SOC's tend to be more constrained memory wise than their M3 cousins throwing away code space and performance like this doesn't help the image of the M0.

 

Quote:

I've been digging through the backend code https://gcc.gnu.org/viewcvs/gcc/..., somewhere in there must be a rule that applies the optimisation to M3 but not M0.

I hope you find something....  I would join you but my internet is way too slow at the moment.

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ok,

 

Now i am convinced that the GCC optimiser for Cortex-M0 is not only very bad, it is buggy.  And if you don’t want to read all this, skip to the end for a potential work around....

 

I have a test which just writes consecutive locations in memory, using an array (actually a pointer and offset but its the same thing).  It looks like this:

 

/* Write 8 bit values to known register locations - using an array */
void test6(void)
{
    volatile uint8_t* const r = (uint8_t*)(0x40002800U); // Register Array

    r[0]   = 0xFF;
    r[1]   = 0xFE;
    r[2]   = 0xFD;
    r[3]   = 0xFC;
    r[4]   = 0xEE;
    r[8]   = 0xDD;
    r[12]  = 0xCC;
}

 

Now, the compiler should, in my opinion, generate an address for r[0] and then if it can use indexed addressing to access the elements in the array.  Instead the compiler generates a unique entry in the literal table for each array access like this:

 

000000ec <test6>:
  ec:	22ff      	movs	r2, #255	; 0xff
  ee:	4b0a      	ldr	r3, [pc, #40]	; (118 <test6+0x2c>)
  f0:	701a      	strb	r2, [r3, #0]
  f2:	4b0a      	ldr	r3, [pc, #40]	; (11c <test6+0x30>)
  f4:	3a01      	subs	r2, #1
  f6:	701a      	strb	r2, [r3, #0]
  f8:	4b09      	ldr	r3, [pc, #36]	; (120 <test6+0x34>)
  fa:	3a01      	subs	r2, #1
  fc:	701a      	strb	r2, [r3, #0]
  fe:	4b09      	ldr	r3, [pc, #36]	; (124 <test6+0x38>)
 100:	3a01      	subs	r2, #1
 102:	701a      	strb	r2, [r3, #0]
 104:	4b08      	ldr	r3, [pc, #32]	; (128 <test6+0x3c>)
 106:	3a0e      	subs	r2, #14
 108:	701a      	strb	r2, [r3, #0]
 10a:	4b08      	ldr	r3, [pc, #32]	; (12c <test6+0x40>)
 10c:	3a11      	subs	r2, #17
 10e:	701a      	strb	r2, [r3, #0]
 110:	4b07      	ldr	r3, [pc, #28]	; (130 <test6+0x44>)
 112:	3a11      	subs	r2, #17
 114:	701a      	strb	r2, [r3, #0]
 116:	4770      	bx	lr
 118:	40002800 	.word	0x40002800
 11c:	40002801 	.word	0x40002801
 120:	40002802 	.word	0x40002802
 124:	40002803 	.word	0x40002803
 128:	40002804 	.word	0x40002804
 12c:	40002808 	.word	0x40002808
 130:	4000280c 	.word	0x4000280c

Which is just massively wasteful of code space.  Now in this example, GCC knows EVERYTHING there is to know about the array, it knows its address, and it knows we are accessing it with offsets, but it still decides to generate unique address entries for each array access.

 

So I made another test and I moved the array to low memory address 0x10 in this instance, and forget that we cant actually write to those locations because they are in FLASH, GCC doesn’t know this and it doesn’t care.  I didn't change anything except the base address of the array (to make it 8 bit) and this is what GCC generates:

 

00000134 <test7>:
 134:	2310      	movs	r3, #16
 136:	22ff      	movs	r2, #255	; 0xff
 138:	701a      	strb	r2, [r3, #0]
 13a:	3a01      	subs	r2, #1
 13c:	705a      	strb	r2, [r3, #1]
 13e:	3a01      	subs	r2, #1
 140:	709a      	strb	r2, [r3, #2]
 142:	3a01      	subs	r2, #1
 144:	70da      	strb	r2, [r3, #3]
 146:	3a0e      	subs	r2, #14
 148:	711a      	strb	r2, [r3, #4]
 14a:	3a11      	subs	r2, #17
 14c:	721a      	strb	r2, [r3, #8]
 14e:	3a11      	subs	r2, #17
 150:	731a      	strb	r2, [r3, #12]
 152:	4770      	bx	lr

In this case, the compiler actually generated optimal code.  But the only change was the start address of the array, so why GCC decided to now use indexed addressing when before it refused to is a mystery.

 

But it gets stranger.  I then tried to set the base address so it isnt so low (not 8 bits) but still quite low.  So i set it to 0x200 and this is what we get:

 

00000154 <test8>:
 154:	2380      	movs	r3, #128	; 0x80
 156:	22ff      	movs	r2, #255	; 0xff
 158:	009b      	lsls	r3, r3, #2
 15a:	701a      	strb	r2, [r3, #0]
 15c:	4b07      	ldr	r3, [pc, #28]	; (17c <test8+0x28>)
 15e:	3a01      	subs	r2, #1
 160:	701a      	strb	r2, [r3, #0]
 162:	4b07      	ldr	r3, [pc, #28]	; (180 <test8+0x2c>)
 164:	3a01      	subs	r2, #1
 166:	701a      	strb	r2, [r3, #0]
 168:	4b06      	ldr	r3, [pc, #24]	; (184 <test8+0x30>)
 16a:	3a01      	subs	r2, #1
 16c:	701a      	strb	r2, [r3, #0]
 16e:	3a0e      	subs	r2, #14
 170:	705a      	strb	r2, [r3, #1]
 172:	3a11      	subs	r2, #17
 174:	715a      	strb	r2, [r3, #5]
 176:	3a11      	subs	r2, #17
 178:	725a      	strb	r2, [r3, #9]
 17a:	4770      	bx	lr
 17c:	00000201 	.word	0x00000201
 180:	00000202 	.word	0x00000202
 184:	00000203 	.word	0x00000203

So in this case, GCC CAN generate some of the addresses using indexed addressing but not others, for no apparent logical reason.  I am always fearful when code is generating illogical results.

 

Albert Einstein is quoted as saying "the definition of insanity is doing something over and over again and expecting a different result."  Well in this case we are doing the same thing over and over again, and we can not only expect to get differnt results, we are guaranteed to get different results.  Insanity???

 

So i tried one last test, this time i declared the array as extern, so the compiler doesn’t know its base address, it could be low, it could be high, it could be anywhere.  And what does the compiler generate:

00000188 <test9>:
 188:	22ff      	movs	r2, #255	; 0xff
 18a:	4b08      	ldr	r3, [pc, #32]	; (1ac <test9+0x24>)
 18c:	681b      	ldr	r3, [r3, #0]
 18e:	701a      	strb	r2, [r3, #0]
 190:	3a01      	subs	r2, #1
 192:	705a      	strb	r2, [r3, #1]
 194:	3a01      	subs	r2, #1
 196:	709a      	strb	r2, [r3, #2]
 198:	3a01      	subs	r2, #1
 19a:	70da      	strb	r2, [r3, #3]
 19c:	3a0e      	subs	r2, #14
 19e:	711a      	strb	r2, [r3, #4]
 1a0:	3a11      	subs	r2, #17
 1a2:	721a      	strb	r2, [r3, #8]
 1a4:	3a11      	subs	r2, #17
 1a6:	731a      	strb	r2, [r3, #12]
 1a8:	4770      	bx	lr
 1aa:	46c0      	nop			; (mov r8, r8)
 1ac:	00000000 	.word	0x00000000
			1ac: R_ARM_ABS32	rx

The code it should have generated for each and every one of these test cases, that is what.  This is a bug, it isn't just poor optimisation, its buggy and its bad.  Cortex M0 code is being needlessly and excessively bloated by this buggy code generation and every developer is paying the price in terms of reduced performance, and larger applications.  I actually suspect part of the reason ASF seems so bloated on M0 is precisely because of this code generation.  The reason...

 

ASF declares ALL its registers as constants at known addresses.  The presumption of the programmer, i assume, is that its better that way, the compiler can optimise the code.  That's a reasonable assumption if the compiler was actually behaving properly (sanely?).  When the reality is, as i have demonstrated, it is significantly worse.  Every access to the memory mapped registers is going to be slower, and is going to consume more code space than if the registers were defined in the linker.  It would be better to define the registers as an external base address (not know at compile time) and a known offset from that address for the individual registers.  But that would require a re-write of ASF.

 

Now that I know this, for my own low level driver library i will be declaring registers exactly like that.  I don’t think it will always work, i suspect the buggy optimiser will still play havoc with the code generation, but its the only work around i can find.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I am still not very familiar with gcc backend, the machine description is a language all by itself, but I did notice a parameter "pool_range". Without being sure, I think it is used to specify how far away the constant pool can be from the current instruction. From the symptoms you describe, it is as if it is calculating the distance from 0x0000 instead of the current PC, so it optimises for low addresses but not high addresses.

 

That does sound like a bug, I can't think why it would be intentional behaviour.

Bob.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The ARM GCC maintainers are looking into the bug now, which is good.  But there is definitely something weird going on.  And it does seem like it has something to do with a base of 0x00 or at least 8 bit values. 

Example, its more efficient to generate a constant that is close to another constant by just adding or subtracting an 8 bit value from a pre-existing constant.  And GCC will do this on Cortex-M0 but only if its a very low number (like 8 bits). 

If you have a constant 0x10 and the code needs a constant 0x2C, it will just add 0x1C to the 0x10 constant and its got the new constant using two bytes of code space, and 1 cycle of cpu time.

But if you have a constant in a register 0x80000010 and need 0x8000002C it wont add 0x1c to it, it instead will create a literal table entry for it, and load it, which needs 6 bytes of code space, plus the memory load time to execute (2 cycles plus memory wait states). (three times the code memory, and at least twice the processing time)

 

Hopefully the bug report sees this sort of thing addressed, because i believe it will yield big decreases in code size for M0 programs, and also speed improvements.

 

Fingers Crossed.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

One technique i've observed on IAR is that it will load an immediate 8 bit value and shift if the number is suitable.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes, I have seen that optimisation as well with Cortex-M3, but with M0 it is not consistently applied when it can be.  M3 code generation is much better at applying these optimisations consistently.

 

Realistically, M0 code generation should be identical to M3, except when the 32 bit thumb2 instructions are required that the M0+ doesn't have.  That should be the only difference between the two.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Realistically, M0 code generation should be identical to M3

I dunno.  They seem to be much more dramatically different than you would expect.

    ldr r1, [r2, #offset]
// Legal values for "offset" on CM3: -255 to 4095
//                           on CM0:  (0 to 31) * argsize

    adds r1, r1, #imm
// Legal values for "imm" on CM3:  0 to 4095
//                        on CM0: 0 to 255.  Different for "add"
//                                           other special cases.

(this is looking at instruction set documents from two different vendors, so I COULD just be reading it wrong or something.   I found this really surprising; like most people, I though that CM3 was pretty much a strict superset of the CM0+...)  (I suppose that it means that on CM3 you "unexpectedly" end up with 32bit thumb2 instructions (that M0 doesn't support) more often than you thought you would...)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I actually said. 

strontiumdog wrote:

Realistically, M0 code generation should be identical to M3, except when the 32 bit thumb2 instructions are required that the M0+ doesn't have.

 

M0 IS a strict subset of M3:

from http://community.arm.com/docs/DOC-7034 :

Quote:

"each architecture is a superset of those below it in the range."

 

The test case I produced clearly showed that GCC "CAN" generate much better code for M0, M0+ than it currently does.  I compiled for M3, changed the cpu type in the generated assembler to M0 and then assembled the result, the assembler does not complain.  This proves that the generated code is ONLY using the M0 subset for M3.  Accordingly there is no reason why when i compile that code for M0 i shouldn't get the same result.  If I change my tests, i can make that assembly verification test fail because the compiler will emit M3 specific instructions when its necessary to generate better code.  But for each of my tests the code generated is ONLY using the strict M0 subset.  My contention is that GCC shouldn’t generate WORSE code for M0 than M3, UNLESS it is required to use the M3 superset to be optimal.

 

Specific Example, M0 AND M3 can load an 8 bit constant with a 16 bit instruction like this :

MOVS <Rd>,#<imm8>

M3 can also load a 8 bit constant using

MOVW<c> <Rd>,#<imm16>

That is a 32 bit instruction.  I suspect that people would be screaming blue murder if GCC emitted this instruction to load 8 bits on the M3.  Its a 32bit instruction when the 16 bit instruction can do the same thing.  Well the situation on M0 is MUCH MUCH worse, as my test case demonstrates.

 

There is even a nice picture showing the strict Subsets of the Cortex M family:

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I guess I wasn't expecting certain immediate values or offsets to push me into thumb-2 territory.  It's certainly not apparent on the pretty picture...

 

(I don't suppose there is a nice table somewhere that DOES break out this sort of information across cortex-m lines?)

 

(blah: SO many special cases.  ADDS can have an 8bit immediate value anywhere, but ADD (as a special case of ADD<c>) can only do so inside an IT block.)

 

Last Edited: Fri. Oct 9, 2015 - 05:02 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This piqued my interest, so here's what I found so far (all work done on gcc HEAD).

 

The reorg pass places constant addresses in the literal pool and modifies instructions to load from them (arm_reorg in gcc/config/arm/arm.c). The code there sees if there is a constant operand involving memory access in an insn, and if yes, marks it for pooling later. This is why you're not seeing the problem if the address is not known at compile time (extern, in your example) - there is no constant operand.

 

So why doesn't it happen always? Turns out the postreload pass that runs before it adjusts some, but not all insns to reuse the existing value (address) in the register and just bump up the offset. Still haven't figured out why it doesn't happen always, will get back once I do.

Also, cortex-m3 is armv7-m, whereas cortex-m0 is armv6-m, and armv6-m allows only 8 bit immediate loads - no rotation, spaced repetition etc. Maybe that causes a cascading effect somewhere down the line?

Regards

Senthil

 

blog | website

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks for finding this out.

 

I am back in my office next week and its high on my todo list to both build an ARM Embedded GCC from head, and see if it behaves the same, and then to look into this code.

 

 

 

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Found out why postreload does the insn transform for certain insns only.

 

thumb1_size_rtx_costs sets the size to 8 if the immediate value being loaded satisfies the J or K constraint (constant is in range of -1 to -255, or is in range of 0 to 255 multipled by any power of 2). This makes postreload prefer the alternative rtx of adding the offset to existing register value. For others, the size cost (and speed cost) of both alternatives work out to be the same i.e. 4, and postreload keeps the original rtx that loads the immediate constant. This explains why, in your example, some constant addresses were kept in the literal pool whereas others were computed.

 

Not sure what the best way is to fix this though. I don't know why costs are higher (8) for immediate constants that can actually be loaded easily, and are lower for those that require  some computation to load, although tat turned out to help improve code in this case. Best left to the ARM maintainers I guess.

Regards

Senthil

 

blog | website