Compiler optimisation oddities

Go To Last Post
28 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

 

I have a habit of examining the assembly output of my code in understanding what the compiler produces and I have managed to pick up a few problems / bugs in my C code. However, I have come across this bit a generated code and am curious why the compiler chose to do this: movw r24,r0 and then movw r30, r24 a couple of lines down. Please see the assembly output below compiled with the Release mode in AVR Studio 7 but the optimisation strategy set to -Os

 

 14c: 55 e0        ldi r21, 0x05 ; 5
 14e: 85 9f        mul r24, r21
 150: c0 01        movw r24, r0
 152: 11 24        eor r1, r1
 154: fc 01        movw r30, r24
 156: e0 50        subi r30, 0x00 ; 0
 158: ff 4f        sbci r31, 0xFF ; 255
 15a: 80 91 33 01  lds r24, 0x0133 ; 0x800133 <_ZL6s_tick>
 15e: 80 83        st Z, r24
 160: 61 83        std Z+1, r22 ; 0x01
 162: 42 83        std Z+2, r20 ; 0x02
 164: 34 83        std Z+4, r19 ; 0x04
 166: 23 83        std Z+3, r18 ; 0x03
 168: 08 95        ret

The compiler appears to waste 1 word of memory as the contents of r24 is not used again. Surely the compiler could have issued movw r30, r0 on the third line and removed the movw r30, r24 on the fifth line.

 

I wouldn't normally worry about such issues but given that I have to pack a lot of code into 3K of memory, I want to be able to leverage as much as I can.

 

Also I can probably remove the eor r1, r1 because the result of mul will never exceed 255 but I don't seem to know how I can guarantee this fact to the compiler. The __builtin_unreachable() compiler hint doesn't seem to work. For reference the size of the s_timerCallbacks struct is 5 bytes and c_numberCallbackTimers is 10.

 

void timer_scheduleCallback(uint8_t callbackID, uint8_t duration, uint8_t count, timerCallbackFn fn) {
   
    if (callbackID >= c_numberCallbackTimers) __builtin_unreachable();

    s_timerCallbacks[callbackID].start = s_tick;
    s_timerCallbacks[callbackID].duration = duration;
    s_timerCallbacks[callbackID].count = count;
    s_timerCallbacks[callbackID].fn = fn;
}

Am I missing something here or is this a strange oddity that I should probably fix after compilation if required.

 

Cheers

Adam Hamilton

Last Edited: Mon. Mar 27, 2017 - 01:03 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

adam3141 wrote:
I want to be able to leverage as much as I can.

That is one of the reason I strictly use assembly as it affords me complete control over flow of program and I can do things a compiler just can't.

 

Here I'm modifying R20 in an ISR

WDT_ISR:
    // Normally in any ISR registers that are altered should be preserved,
    //  SREG. In this simple example it doesn't matter.

        dec 	R20   		; Decrement count
        sbi 	PINB, PINB1  	; Toggle indicator LED
        reti

 and in the loop I wait for R20

Loop: // Infinite loop waiting for user to press button

 	sbi 	PINB, PINB0  		; Enable push button, LED is on
 	sei    				; Re-enable interrupts 
 	sleep    			; Wait for hit on INT0

 	or 	R20, R20  		; ISR modifies R20 so poll register
 	brne 	PC - 1   		; until count is exhausted.
 
    // Functionally equivalent to code on pg 52, except I didn't reset
    // WDRF in MCUSR because WDT isn't set for RESET.

 	cli    				; Disable interrupts
 	wdr
 	ldi 	TEMP, T_Seq  		; WDCE or WDE
 	sts 	WDTCSR, TEMP  		; Begin timed sequence
 	sts 	WDTCSR, R20  		; Turn off WDT
  
 	rjmp Loop   			; Wait for another press on button

Whether this is good practice or not, is not the point, that it can't be done in C (this exact way) is.  One way or another you have to become familiar with the instruction set, which it seems you have some grasp of anyway, so it's a matter of are you going to learn AVR or C.  To-date, my benchmarks have demonstrated a 27% saving in space using assembly vs compiler, but it must be noted, I'm not very good at C.

 

Last Edited: Mon. Mar 27, 2017 - 04:07 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The compiler may need to extend the values to an int. This generates unnecessary code when dealing with bytes. I dare say the IAR compiler is more aggressive with how it generates code, but you pay a premium.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Very interesting use of the watchdog timer.

For simple projects, developing in assembly language is useful and is definitely worth doing when you're learning so you get a really solid understanding of the microcontroller. For a lot of my projects, developing in pure assembly would be some kind of hellish nightmare to say the least. I use C mainly because the compiler can optimise (99% of the time) a lot of things for me while making my code readable to others who will have to review my code. I do note that some constructs that generate the same intent will generate different assembly, I will sometimes surround certain constructs in #if #elif blocks to see which construct generates the smallest code for instance.

​I just wonder whether or not I can convince the compiler that I know more than it in particular cases.

Cheers.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

​Kartman wrote:

The compiler may need to extend the values to an int. This generates unnecessary code when dealing with bytes.

I usually spot this type of unnecessary code generation in the assembly output by the excessive use of addw, movw and the like for which I look back through my code and go, I forgot to cast that intermediate result to a uint8_t

In one of my projects I had this

return (flags & (uint8_t)(1 << flag)) != 0;

Because I forgot to add the (uint8_t) cast, it was converting the result of (1 << flag) to a 16 bit value and propagating that through, something I probably should have realised but being that flags itself was only 8 bits then bits 8-15 of the intermediate were likely to be thrown out anyway so a smart optimising compiler would never have followed through with the 16 bit calculations.

Cheers

Last Edited: Mon. Mar 27, 2017 - 05:05 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

adam3141 wrote:
For a lot of my projects, developing in pure assembly would be some kind of hellish nightmare to say the least.

I don't know that I would agree 100% with that statement, but, change to a different processor, collaborate or come back in a year a look at your code, now that's a different story. It is for that reason I adopt this methodology.

IRQ0_ISR:
 // Enable WDT with desired pre-scaler of 131072 cycles / tick.

 	wdr
 	ldi 	R20, T_Seq  			; WDCE | WDE 
 	ldi 	TEMP, 0b1000110  		; WDIE = 1 and WDP 2:1 = 1, 3 & 0 = 0
 	sts 	WDTCSR, R20  			; Begin timed squence
 	sts 	WDTCSR, TEMP  			; Table 11-2 pg 55

and

/*
    The push button is going to be rigged like a deadmans switch. While the 
    indicator LED is on and when the button is released, the LED will go out
    and the WDT will start blinking the LED at near 1 sec intervals
    
 	PWR <- PB0 controls indicator light on button
 	SIG -> PD2 when released will trigger INT0
 	GND -> Any of the ground pins.   
*/
 	sbi 	PORTD, PORTD2  			; Pullup INT0 for falling edge.
 	ldi 	TEMP, 0b0010  			; Table 31-2 pg 71 INT0 on falling edge
 	sts 	EICRA, TEMP  			; Set interrupt 0 sense control 69H
 	ldi 	TEMP, 1
 	out 	EIMSK, TEMP  			; Enable INT0 (13.2.2 pg 72)

Where I've essentially traded writing a book versus using C, with lots of description an especially references to datasheet. Maybe with the limited space you've got to work with inline assembly might be an alternative.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I tried the following code (as opposed to your snippet, it can be compiled without error):

extern struct
{
    char a, b, c;
    void *p;
} S[];

extern char a;

void func (unsigned char id, char b, char c, void *p)
{
    S[id].a = a;
    S[id].b = b;
    S[id].c = c;
    S[id].p = p;
}

Compiling with avr-gcc v6

> avr-gcc -mmcu=avr4 -S -Os mul.c

yields

func:
	ldi r25,lo8(5)
	mul r24,r25
	movw r30,r0
	clr __zero_reg__
	subi r30,lo8(-(S))
	sbci r31,hi8(-(S))
	lds r24,a
	st Z,r24
	std Z+1,r22
	std Z+2,r20
	std Z+4,r19
	std Z+3,r18
	ret

The clr __zero_reg__ is due to avr-gcc ABI, and you'll have to live with it (or completely rewite GCC's avr back end which is doable as you have the sources available).

 

> and am curious why the compiler chose to do this

 

The only way of finding out is to debug into avr-gcc.  Maybe reading the dumps will also give you some insight, so you may want to give -fdump-tree-all -fdump-rtl-all -fdump-ipa all a try.

 

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Useless moves to R25:24 is something I've seen before.

Don't know why.

My guess is that in avr-gcc's cost calculation, MOVW's to and from R25:24 get an excessive discount.

Quite possibly just changing one or two numbers would fix the issue.

This one does not know where to find the numbers.

 

If you want to use C and still get optimizations avr-gcc won;t do,

I see two work-arounds.

Inline assembly and hand-edited compiler output.

Though I expect it will, I won't promise the former will work.

The code the compiler issues to match C variables with inline assembly arguments might defeat you.

For the latter, compiler your C to assembly.

Make a copy of the assembly.

Edit the copy.

The result is what you assembly to produce an object file.

Keep the first two files around for documentation and possible changes.

 

"Demons after money.
Whatever happened to the still beating heart of a virgin?
No one has any standards anymore." -- Giles

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

From #7

> and am curious why the compiler chose to do this

It's a left over from when LPM only worked on R0 and the MUL instruction weren't made yet. Then it was kind of logic to place the _zero reg in r1. 

And I guess that the rest is history.

 

To OP, if you can't live with one extra cycle here and there, there is only one way and that is ASM.  

And then structure the program so it fit the job on task, don't take a C output and optimize that will never be optimal. (just at tad better than C)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

SprinterSB wrote:

I tried the following code (as opposed to your snippet, it can be compiled without error):

extern struct
{
    char a, b, c;
    void *p;
} S[];

extern char a;

void func (unsigned char id, char b, char c, void *p)
{
    S[id].a = a;
    S[id].b = b;
    S[id].c = c;
    S[id].p = p;
}

Compiling with avr-gcc v6

> avr-gcc -mmcu=avr4 -S -Os mul.c

yields

func:
	ldi r25,lo8(5)
	mul r24,r25
	movw r30,r0
	clr __zero_reg__
	subi r30,lo8(-(S))
	sbci r31,hi8(-(S))
	lds r24,a
	st Z,r24
	std Z+1,r22
	std Z+2,r20
	std Z+4,r19
	std Z+3,r18
	ret

 

 

I tried this a few ways using your example

Using version 4.8.2 avr-gcc in linux I get the same as your output, both compiled as a c and cpp file, both using avr4 and my specfic atmega328p

Putting your code straight into Atmel Studio 7.0 in windows and making sure temps are not deleted, I get that redundant move instruction.

 

Now, I tried my Timer.cpp code using 4.8.2 using avr-gcc in linux and funnily enough, I get the same assembly as your small example shows.

 

With Atmel Studio, I created a blank config and unchecked everything that I could in terms of optimisation and just enabled optimise for size and I got the redundant move so I think this may either have something to do with Atmel Studio Packs or maybe Atmel has done something else with the configuration of the compiler. The atmel studio compiler is 4.9.2. Note: I couldn't compile my code from the windows command line using the same parameters you specified as I got some error about device-specs not being found.

 

I was more curious as to why the redundant move instruction was generated. I might grind out some assembly code if I scrape over my 3K limit.

 

Cheers all for your insights

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

AVR_Coder_1 wrote:

I don't know that I would agree 100% with that statement, but, change to a different processor, collaborate or come back in a year a look at your code, now that's a different story. It is for that reason I adopt this methodology.

 

 

I'm not sure I follow.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Useless moves to R25:24 is something I've seen before.

Don't know why.

Isn't R25:24 the spot for returning (int) values from a function?  Perhaps it's left over from treating the multiply like a function...

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Why the 3k limit?

Do you make a module or is the rest a bootloader or data.

 

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

westfw wrote:
Isn't R25:24 the spot for returning (int) values from a function? Perhaps it's left over from treating the multiply like a function...

 

You might be onto something there.

 

I would have no idea where to go digging in the source code to see why I am seeing this missed optimisation but it could be that this particular version of avr-gcc (4.9.2) uses a function to multiply with the result left in r25:r24 which is then copied into r31:r30. This happens only in AVR studio AFAIK and I am not proficient enough to write my own makefiles and compile them in Linux, I might give it a go some day...

 

Cheers

Last Edited: Tue. Mar 28, 2017 - 07:44 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
Why the 3k limit? Do you make a module or is the rest a bootloader or data.

 

The device my firmware will ultimately go into is an ATmega88P. The boot loader will be 1K in size and because this microcontroller is going to be turning a system on, it is highly desirable to duplicate the application firmware in case of corruption, the bootloader will be able to copy the uncorrupted version across. So I guess actually have 3.5K.

 

Ultimately, if I can fit my code into less than 3.5K then that's great, I won't need to try and hand optimise anything. If it scrapes over that limit then I might try and fix a few routines up here and there. Failing that, I will need to convince the team to drop in a 168P or get the bootloader to turn on the system in case of corruption.

 

Cheers

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

adam3141 wrote:
but it could be that this particular version of avr-gcc (4.9.2)
Latest issue of AS7 has avr-gcc 5.4.0. While it may not change anything it has to be worth trying a much more up to date issue of the compiler.
adam3141 wrote:
I am not proficient enough to write my own makefiles
AS7 writes a makefile for you every time you build - why can't you not just lift that and use it in Linux? Alterantively in Linux use Mfile to get Makefiles created for you.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I don't know what kind of lib's you use, but I guess that some common variables could be placed in registers will reduce both size and speed, and use GPIO's aswell

 

With the newer compilers stay away from switches they become big and slow, (I still use 3.4 for the same reason. winavr2010). 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:

Latest issue of AS7 has avr-gcc 5.4.0. While it may not change anything it has to be worth trying a much more up to date issue of the compiler.

The latest version on Atmel's site is build 1188 September 2016 release (which I have) still uses avr-gcc 4.9.2 so I checked for the separate toolchain download and it too was 4.9.2

Is there anything I am missing?

 

clawson wrote:
AS7 writes a makefile for you every time you build - why can't you not just lift that and use it in Linux? Alterantively in Linux use Mfile to get Makefiles created for you.

 

I might actually give Mfile a go. Thanks for the suggestion.

 

Cheers

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

adam3141 wrote:
Is there anything I am missing?
No. Atmel have issued AS7 with 5.4.0. They have not yet updated the standalone "toolchains" to match. So if you want to try their 5.4.0 you need to install latest AS7

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Here the other day I made the update, and it still make a very "generic" switch the take about 30 clk, when there are 6 or more cases! (the old compiler take max 1/2 of that , and about 1/2 size.)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sparrow2 wrote:
it still make a very "generic" switch the take about 30 clk, when there are 6 or more cases!
Try compiling with -fno-jump-tables.-fno-jump-tables-fno-jump-tables-fno-jump-tables-fno-jump-tables-fno-jump-tables-fno-jump-tables-fno-jump-tables

 

Edit: What's that? I don't see (and therefore can't remove) that repetition when editing.

Stefan Ernst

Last Edited: Tue. Mar 28, 2017 - 12:23 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

adam3141 wrote:
The device my firmware will ultimately go into is an ATmega88P. The boot loader will be 1K in size and ...
TinySafeBoot is 0.5KB for megaAVR including mega88P.

TinySafeBoot - A tiny, safe and flexible AVR-Bootloader for ATtinys and ATmegas

http://jtxp.org/tech/tinysafeboot_en.htm

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

gchapman wrote:
TinySafeBoot is 0.5KB for megaAVR including mega88P.

 

Unfortunately, given the complexity our bootloader must have, this is just not an option.

 

 

So, I upgraded AS7 to use avr-gcc 5.4.0 and here is what I have come up with

 

"C:\Program Files (x86)\Atmel\Studio\7.0\toolchain\avr8\avr8-gnu-toolchain\bin\avr-g++.exe" -Os -mmcu=atmega328p -S Timer.cpp

/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
	ldi r21,lo8(5)
	mul r24,r21
	movw r24,r0
	clr __zero_reg__
	movw r30,r24
	subi r30,lo8(-(_ZL16s_timerCallbacks))
	sbci r31,hi8(-(_ZL16s_timerCallbacks))
	lds r24,_ZL6s_tick
	st Z,r24
	std Z+1,r22
	std Z+2,r20
	std Z+4,r19
	std Z+3,r18
	ret

As you can see, the compiler still inserts the redundant 'movw r24, r0' instruction

 

Now under linux (well, bash for windows ubuntu 14.04) using avr-gcc 4.8.2

avr-g++ -Os -mmcu=atmega328p -S Timer.cpp

/* prologue: function */
/* frame size = 0 */
/* stack size = 0 */
.L__stack_usage = 0
	ldi r25,lo8(5)
	mul r24,r25
	movw r30,r0
	clr __zero_reg__
	subi r30,lo8(-(_ZL16s_timerCallbacks))
	sbci r31,hi8(-(_ZL16s_timerCallbacks))
	lds r24,_ZL6s_tick
	st Z,r24
	std Z+1,r22
	std Z+2,r20
	std Z+4,r19
	std Z+3,r18
	ret

The result of the mul instruction is moved into r31:r30 directly without going through r25:r24

 

I am at a loss to explain it other than Atmel modifies the source files in some way, or their device packs do something to alter the generated assembly.

 

It would appear that this issue is solved by using inline assembly after it is determined that the generated code is unacceptable or by using the Linux avr-gcc versions.

 

Cheers everyone for your help and comments

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If one single, superfluous instruction renders you code unacceptable, then you'll have to use (inline) assembly -- or a compiler brand that guarantees that it will never generate any such superfluous intruction and always will come up with optimal code.

 

For the record, I tested my test case from above with avr-gcc v4.7 ... v7 and amongst these 6 revisions, only v4.9 and v5 came up with the additional MOVW.

 

It's not unlikely there are other spots in you code where you can save some instructions, e.g. by inlining, LTO or similar, more explicit optimization hints.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

As I said from the start I am not too worried about it, rather pointing out an oddity I found. If my code compiles and I fit within my code size limits then great, I need not bother with any more size optimisation then the default -Os provides me. I was just curious if this was a known issue, or there was some other compiler switch that I had not tried.

 

A different compiler brand is out of the question, sadly, it would be nice if we could use the IAR suite of compilers.

 

I have tried LTO, it obliterated my .initn sections so I will need to research a workaround, or indeed how to use the -flto compiler switch more effectively.

 

Cheers

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

What you mean with "obliterate initn sections"?  Presumably, the respective code is just missing attribute __used__. Or you have some problem in your linker description file.

avrfreaks does not support Opera. Profile inactive.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have some code in section .init3 and .init8. When I compile with flags O3 or -Os, the code within those sections are present in the final object file. However, if I build my project with -flto also, the code within my .init3 and .init8 are no longer present.

 

I am by no means an expert at this. I mainly just rely on Atmel Studio to do most things for me. Actually, I need to crawl out of my shell so to speak and get down and dirty with makefiles and linker scripts...

 

Thanks, I will certainly give the attribute __used__ ago.

 

Cheers

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Even without LTO, .initN functions will be removed provided they are static (local to C module) and optimization is on.  This is because these functions have no usage in C/C++ code and are never called explicitly, and may be thrown away thusly. This is why __used__ is needed.

 

As an alternative, you can make symbols __externally_visible__, but there is no reason to clutter global name space: each module could come, for example, with it's own private static functon called "initN" in section .initN provided they are static.

 

This is a bit different to __constructor__ where GCC knows about implicit usage and hence implies __used__, IIRC.  Drawback of __constructor__ is that it requires some context and might come with more overhead, but __constructor__ are more like a functions (as opposed to .initN which must not have a frame).

 

I could add similar knowledge to avr-gcc, but it won't enter mainline before v8... Thus for the time being, just attribute respective functions as __used__.

 

avrfreaks does not support Opera. Profile inactive.