Using GCC ARM Embedded from https://launchpad.net/gcc-arm-em... (Which is what arduino and and everyone else uses) produces some very poor code density for Cortex M0 and M0+ at the moment:
https://bugs.launchpad.net/gcc-a...
What it is doing is producing redundant entries in the literal tables (which consumes excessive code space) and then uses code which would otherwise be not needed to read those values from the table which reduces performance and increases register pressure.
This is an example of what its doing:
int main (void) { const uint16_t p1 = 0x1234; const uint16_t p2 = 0x9876; const uint32_t p3 = 0x12349876; const uint32_t p4 = 0x21; const uint32_t p5 = 0x33; volatile uint16_t* const first16 = (uint16_t*)(0x40002800U); volatile uint32_t* const first = (uint32_t*)(0x40002800U); volatile uint32_t* const second = (uint32_t*)(0x40002804U); volatile uint32_t* const third = (uint32_t*)(0x40002808U); volatile uint32_t* const fourth = (uint32_t*)(0x4000280CU); *first16 = p1; *first = p1; *first16 = p2; *second = p3; *third = p4; *fourth = p5; while (true) {} }
Generates at -Os, for Cortex m0+ this :
00000118 <main>: volatile uint32_t* const first = (uint32_t*)(0x40002800U); volatile uint32_t* const second = (uint32_t*)(0x40002804U); volatile uint32_t* const third = (uint32_t*)(0x40002808U); volatile uint32_t* const fourth = (uint32_t*)(0x4000280CU); *first16 = p1; 118: 4b07 ldr r3, [pc, #28] ; (138 <main+0x20>) 11a: 4a08 ldr r2, [pc, #32] ; (13c <main+0x24>) 11c: 801a strh r2, [r3, #0] *first = p1; 11e: 601a str r2, [r3, #0] *first16 = p2; 120: 4a07 ldr r2, [pc, #28] ; (140 <main+0x28>) 122: 801a strh r2, [r3, #0] *second = p3; 124: 4a07 ldr r2, [pc, #28] ; (144 <main+0x2c>) 126: 4b08 ldr r3, [pc, #32] ; (148 <main+0x30>) 128: 601a str r2, [r3, #0] *third = p4; 12a: 2221 movs r2, #33 ; 0x21 12c: 4b07 ldr r3, [pc, #28] ; (14c <main+0x34>) 12e: 601a str r2, [r3, #0] *fourth = p5; 130: 4b07 ldr r3, [pc, #28] ; (150 <main+0x38>) 132: 3212 adds r2, #18 134: 601a str r2, [r3, #0] 136: e7fe b.n 136 <main+0x1e> 138: 40002800 .word 0x40002800 13c: 00001234 .word 0x00001234 140: ffff9876 .word 0xffff9876 144: 12349876 .word 0x12349876 148: 40002804 .word 0x40002804 14c: 40002808 .word 0x40002808 150: 4000280c .word 0x4000280c
All of the addresses in the table at the end of the function, after the first, are not required. They can all be generated using immediate offsets from the first, which would save 12 bytes from the table itself, and at least the 3 instructions from 126, 12C and 130, for a further 6 byte saving. The whole routine is only 60 bytes, so a saving of 18 bytes is pretty huge (1/3). This is a pretty trivial example, but it is FAR from optimal and will result in larger and slower programs for M0plus cores like the D21/L21, etc. This problem does not exhibit with the latest GCC ARM Embedded for the M3 and M4 cores, only M0,M0+ and M1.
If you are using M0 type cores, like the D21 on the Arduino Zero, and would like them to generate faster, denser code you can do yourself a favour and up-vote the bug report so it gets more attention.