code optimization/minimization

Go To Last Post
24 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hello:

Using studio 6.2, GCC compiler

Up until now there was plenty of space & I had a good handful of delays being used here & there:

 

_delay_ms(10);

 

I didn't much care how efficiently the compiler did its thing with this macro.  The compiler probably doesn't notice that all of mine use the same argument value (10)

Since all of my delays are of the same length, it seemed reasonable to attempt  reducing code size:

 

void delay_10ms()
  {_delay_ms(10);
   }

I expected calling my  delay_10ms repeatedly would produce one fixed subroutine that would be called whenever the 10ms delay was needed.  However looking at the .lss assembly file, this does not appear to be the case...each delay appears to be an individual collection of delay asm commands...no sign of any calls or rcalls.  How do I get the improvement I'm seeking?

 

 

 

 

 

 

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Sat. Jun 27, 2015 - 11:24 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

What optimisation settings are you using? Try -Os
If you want to optimise for speed overall, you can use the function attribute:
__attribute__ ((__optimize__("Os")))
to optimise a specific function for size.
You can also specify:
__attribute__ ((__noinlinee__))

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Sun. Jun 28, 2015 - 12:02 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks, well I don't want to optimize for speed (Os?), I want to optimize for size  (not have the code duplicated for each 10ms delay).

So perhaps your suggestion of __attribute__ ((__noinlinee__))  is needed.  [I'm not sure exactly where or how you apply this...in the code?]...I'll take a look

 

  I actually played around with the optimization tab,, but didn't see the desired effect.  If that's all it takes, I'll go back and twiddle with the optimization level some more.

 

 

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

avrcandies wrote:
Thanks, well I don't want to optimize for speed (Os?), I want to optimize for size  (not have the code duplicated for each 10ms delay)
You have misunderstood.

 

To optimise for size, use -Os.

 

Doing so enables all kinds of optimisation strategies to reduce code size, including not inlining code in some cases.  You can give the compiler a hint with __attribute__ ((__noinline__)).

 

To optimise for speed, use -O1, or -O2, or -O3.  Each of these does also optimise for size somewhat, but that is not the main focus.

 

You should have a look at the GCC docs.  There is lots to be read on the subject of optimisation.  Start here:

https://gcc.gnu.org/onlinedocs/gcc-4.2.0/gcc/Optimize-Options.html

 

You can selectively specify optimisation options on a per-function basis with __attribute__ ((__optimize__(<string>)))

 

I would try:

void __attribute__ ((__noinline__, __optimize__("Os"))) delay_10ms() {
  _delay_ms(10);
}

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Sun. Jun 28, 2015 - 03:11 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Apart from the fact that _delay_ms()  plays little part in real MCU apps (use a timer!) the fact is that it simply calls the intrinsic _builtin_avr_delay_cycles() which actually uses some of the most perfectly honed hand crafted assembler code and there is NOTHING on earth you could do to make the code any more space efficient. 

 

If you really have a lot of calls to the delay I would be investigating a "better" solution (use a timer!). 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Shirley,   an inline delay is LDI, LDI, SBIW, BRNE or 4 instructions.    A subroutine call is LDI, LDI, CALL or 4 words of flash.   (3 if you can use RCALL)

Admittedly,   if your subroutine has a void argument,   it ends up as a simple CALL.

 

Mind you,   how many words of flash do you intend to "reduce" ?

 

If you start your message with "I am using a Tiny2313 and it is using 2050 bytes",   it becomes a realistic question.

 

David.

Last Edited: Sun. Jun 28, 2015 - 12:44 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Remember that if it's a 2313 (up to 8K flash), the call only are a rcall, so only one word, so 3 words saved for each.

 

But my guess is that there are better places so save some flash.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If it's really that important,

the brute force solution is to call your function from inline assembly.

Write your function in assembly.

The function can be eight words.

An rcall is only one word.

 

Hiding the call within inline assembly will keep the compiler

from preserving and restoring the call-used registers.

You must ensure that doing so is not necessary.

 

A fundamental problem with cycle-counting delays is that they do not take into account interrupts.

A fundamental problem with busy waits is that they keep the processor busy when it might be doing other things.

 

Cycle-counting busy waits are rarely necessary and often would not work.

As noted, using a timer is usually the better choice.

I can see where the ease of writing _delay_ms(10) might be seductive.

If you are scrounging for flash, ease of coding might not be your highest priority right now.

As noted, there is likely to be better places to conserve flash.

If you removed all of your _delay_ms code, how much flash would you save?

Would it be enough?

 

 

 

 

 

 

 

 

Iluvatar is the better part of Valar.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks, this example investigation worked liked a champ:

void __attribute__ ((__noinline__, __optimize__("Os"))) delay_10ms() {
  _delay_ms(10);
}

I was really interested in how to "force" a subroutine/function call to be generated, rather than just letting the compiler "decide".  Your suggestion quickly took me to the desired settings in a sea of opposing optimization options.

I wonder if the compiler is smart enough to switch to this non-inline coding rather than simply running out of space---depending on the size & occurrences of the repeated code, there could be a large savings.

 

For some reason (highly code dependent), the global  Os setting did not generate the smallest code size,  O2 & O3 did.  After adding the noted inline blocking, O2 & O3 became even bigger winners in terms of space (but probably, not speed).   

   

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Mon. Jun 29, 2015 - 05:05 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

We need to know a bit more.

at least:

AVR model.

Code size

why all the 10ms delays

what kind of ISR's do you have.

perhaps show all you code , if you can't show it here perhaps PM the code, or show a typical part of your code. 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This was just a test of the concept, so all is well.  I was wondering why multiple references to the same functions were being individually fully coded, rather than coded as calls to a common block of code.  This was not happening even with selecting compile Os, for minimizing code size.

Once the suggestion was implemented on the function, asm calls to a common code block were produced and the code size became the lowest yet  (lower than just setting Os).  So apparently Os option does not necessarily try every trick it can to produce the absolute smallest code size. You can manually guide it a little lower.

 

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Mon. Jun 29, 2015 - 03:40 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

 I was wondering why multiple references to the same functions were being individually fully coded,

_delay_ms() is not a function - it's a macro. yes, ultimately it makes a call to the intrinsic __builtin_avr_delay_cycles() but that's simply the outcome of the macro invocation.

So apparently Os option does not necessarily try every trick it can to produce the absolute smallest code size. 

Because it's a macro! 

 

Now I know there is an argument to day that macro names should be in upper case so you can recognise a macro when you see one (PORTB, TCCR1A, etc) and perhaps _delay_ms() should really be _DELAY_MS()? But you need to recognise when you are invoking macros not functions. A macro is simply a string substitution. In this case the macro expands out to be "CALL __builin_avr_delay_cycles(some_fixed_integer_value)"

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
Because it's a macro!
But the OP was asking about the following bit of code:

void delay_10ms()
  {_delay_ms(10);
   }

Seems that GCC was still inlining multiple calls to the wrapper.

 

I haven't tested that myself.  I'll do so and report back.

 

EDIT:

 

I cannot reproduce the results reported by the OP, at least not when using my preferred build options.

 

I did a bit of fiddling and managed it with a more basic build command.

#include <avr/io.h>
#include <util/delay.h>



void delay_10ms(void) {
  _delay_ms(10);
}



int main(void) {

  while(1) {

    delay_10ms();
    GPIOR0 = GPIOR0;

    delay_10ms();
    GPIOR0 = GPIOR0;

    delay_10ms();
    GPIOR0 = GPIOR0;

    delay_10ms();
    GPIOR0 = GPIOR0;

  }

}

 

Results when building with -Os is that the wrapper is not inlined:

$ avr-gcc -Os -mmcu=atmega328p -DF_CPU=16000000UL -g -fdata-sections -ffunction-sections -Wl,--gc-sections osize_test.c -o osize_test.elf
$ avr-objdump -S osize_test.elf > osize_test.lss
00000080 <main>:
	#else
		//round up by default
		__ticks_dc = (uint32_t)(ceil(fabs(__tmp)));
	#endif

	__builtin_avr_delay_cycles(__ticks_dc);
  80:	8f e3       	ldi	r24, 0x3F	; 63
  82:	9c e9       	ldi	r25, 0x9C	; 156
  84:	01 97       	sbiw	r24, 0x01	; 1
  86:	f1 f7       	brne	.-4      	; 0x84 <main+0x4>
  88:	00 c0       	rjmp	.+0      	; 0x8a <main+0xa>
  8a:	00 00       	nop
int main(void) {

  while(1) {

    delay_10ms();
    GPIOR0 = GPIOR0;
  8c:	8e b3       	in	r24, 0x1e	; 30
  8e:	8e bb       	out	0x1e, r24	; 30
  90:	8f e3       	ldi	r24, 0x3F	; 63
  92:	9c e9       	ldi	r25, 0x9C	; 156
  94:	01 97       	sbiw	r24, 0x01	; 1
  96:	f1 f7       	brne	.-4      	; 0x94 <main+0x14>
  98:	00 c0       	rjmp	.+0      	; 0x9a <main+0x1a>
  9a:	00 00       	nop

    delay_10ms();
    GPIOR0 = GPIOR0;
  9c:	8e b3       	in	r24, 0x1e	; 30
  9e:	8e bb       	out	0x1e, r24	; 30
  a0:	8f e3       	ldi	r24, 0x3F	; 63
  a2:	9c e9       	ldi	r25, 0x9C	; 156
  a4:	01 97       	sbiw	r24, 0x01	; 1
  a6:	f1 f7       	brne	.-4      	; 0xa4 <main+0x24>
  a8:	00 c0       	rjmp	.+0      	; 0xaa <main+0x2a>
  aa:	00 00       	nop

    delay_10ms();
    GPIOR0 = GPIOR0;
  ac:	8e b3       	in	r24, 0x1e	; 30
  ae:	8e bb       	out	0x1e, r24	; 30
  b0:	8f e3       	ldi	r24, 0x3F	; 63
  b2:	9c e9       	ldi	r25, 0x9C	; 156
  b4:	01 97       	sbiw	r24, 0x01	; 1
  b6:	f1 f7       	brne	.-4      	; 0xb4 <main+0x34>
  b8:	00 c0       	rjmp	.+0      	; 0xba <main+0x3a>
  ba:	00 00       	nop

    delay_10ms();
    GPIOR0 = GPIOR0;
  bc:	8e b3       	in	r24, 0x1e	; 30
  be:	8e bb       	out	0x1e, r24	; 30
  c0:	df cf       	rjmp	.-66     	; 0x80 <main>

 

A bit of digging has pointed the finger at one of my preferred build options, -ffreestanding.  Without it, the wrapper is inlined:

$ avr-gcc -Os -mmcu=atmega328p -DF_CPU=16000000UL -g -fdata-sections -ffreestanding -ffunction-sections -Wl,--gc-sections osize_test.c -o osize_test.elf
$ avr-objdump -S osize_test.elf > osize_test.lss
00000080 <delay_10ms>:
    milliseconds can be achieved.
 */
void
_delay_loop_2(uint16_t __count)
{
	__asm__ volatile (
  80:	80 e4       	ldi	r24, 0x40	; 64
  82:	9c e9       	ldi	r25, 0x9C	; 156
  84:	01 97       	sbiw	r24, 0x01	; 1
  86:	f1 f7       	brne	.-4      	; 0x84 <delay_10ms+0x4>
  88:	08 95       	ret

0000008a <main>:

int main(void) {

  while(1) {

    delay_10ms();
  8a:	0e 94 40 00 	call	0x80	; 0x80 <delay_10ms>
    GPIOR0 = GPIOR0;
  8e:	8e b3       	in	r24, 0x1e	; 30
  90:	8e bb       	out	0x1e, r24	; 30

    delay_10ms();
  92:	0e 94 40 00 	call	0x80	; 0x80 <delay_10ms>
    GPIOR0 = GPIOR0;
  96:	8e b3       	in	r24, 0x1e	; 30
  98:	8e bb       	out	0x1e, r24	; 30

    delay_10ms();
  9a:	0e 94 40 00 	call	0x80	; 0x80 <delay_10ms>
    GPIOR0 = GPIOR0;
  9e:	8e b3       	in	r24, 0x1e	; 30
  a0:	8e bb       	out	0x1e, r24	; 30

    delay_10ms();
  a2:	0e 94 40 00 	call	0x80	; 0x80 <delay_10ms>
    GPIOR0 = GPIOR0;
  a6:	8e b3       	in	r24, 0x1e	; 30
  a8:	8e bb       	out	0x1e, r24	; 30
  aa:	ef cf       	rjmp	.-34     	; 0x8a <main>

 

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

Last Edited: Mon. Jun 29, 2015 - 04:51 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Surely the compiler will do a cost benefit analysis. Inline code is hardly expensive for the delay sequence. But it becomes a pain when debugging.
Codevision factorises common sequences. Effective for reducing size with the penalties of extra rcall/ret time.
As others have said, if size is a problem there are other things that you try first.

Last Edited: Mon. Jun 29, 2015 - 05:16 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

But the OP was asking about the following bit of code:

 

void delay_10ms()
  {_delay_ms(10);
   }

Yes that is true.  It didn't compile as I'd assume, using Os.

 

However, adding in another variable operation, DOES cause the Os to work as I expected & use a common block for this code

 

void delay_10ms()
  {_delay_ms(10);
   junk++;
   }

...interesting

Apparently I unluckily picked the wrong example to investigate!

 

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

By the way, I'd also like to know how to tell the compiler to try to use avr registers rather than ram for variables...even with O3 this program

#include <avr/io.h>

volatile  uint8_t c=100;
int main(void){ 
 c--;
}

compiles as

 194: 80 91 00 20  lds r24, 0x2000
 198: 81 50        subi r24, 0x01 ; 1
 19a: 80 93 00 20  sts 0x2000, r24
}
 19e: 80 e0        ldi r24, 0x00 ; 0
 1a0: 90 e0        ldi r25, 0x00 ; 0
 1a2: 08 95        ret

using the lds & sts is grossly inefficient. If var C is set as register 13, then a simple:

 

dec r13

 

is all that needs to be generated.

 

is there a way to ask the compiler to try (and possibly give up if the program is too large or complex, or leaves no registers available)

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

But what do you think "volatile" means?!?

 

If you want int in registers don't use the volatile.

 

(of course no code whatsoever will be generated because 'c' is pointless). 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

But what do you think "volatile" means?!?

Good point, I just hate to see the efficiency of the AVR architecture wasted (by using lds, sub  sts, when a simple inc or dec directly to a plentiful register could do the job --that's an 80% reduction in code bytes).

Maybe its best just to keep my eyes closed & let the compiler maintain its occasional inefficient ways.   As long as things run "fast enough" and fit into the chip we can get some sleep.

 

 

 

 

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

Last Edited: Mon. Jun 29, 2015 - 07:11 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Good point, I just hate to see the efficiency of the AVR architecture wasted (by using lds, sub  sts, when a simple inc or dec directly to a plentiful register could do the job --that's an 80% reduction in code bytes).

Maybe its best just to keep my eyes closed & let the compiler maintain its occasional inefficient ways.  

Sheesh--as Cliff said, YOU told the compiler to do it that way, with the 'volatile'.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

 

 YOU told the compiler to do it that way

My apologies, I was just exploring/investigating under what circumstances the GCC compiler would simply "see"  or be guided to "see" that it was able to use a register rather than ram to store a variable.   If the variable is used quite a lot, such as, mode_of_the_system, it could make a large speed/size difference, due to much speedier access/manipulations, bit comparisons, etc.

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The only circumstance under which GCC will store a variable in a register is if that variable is an automatic variable, or if it is declared as a register variable.

 

http://www.nongnu.org/avr-libc/user-manual/FAQ.html#faq_regbind

There are serious restrictions on doing so however, and it should not be done unless you fully understand those restrictions.

 

Also, not all automatic variables will be stored in registers, however.  Which ones are will depend upon the demand placed upon the register allocation algorithm. 

 

No high-level language is designed to give you this kind of control, not even C.   If you need to control register allocation, you should write your apps in assembler.

"Experience is what enables you to recognise a mistake the second time you make it."

"Good judgement comes from experience.  Experience comes from bad judgement."

"Wisdom is always wont to arrive late, and to be a little approximate on first possession."

"When you hear hoofbeats, think horses, not unicorns."

"Fast.  Cheap.  Good.  Pick two."

"We see a lot of arses on handlebars around here." - [J Ekdahl]

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

All that Joey says is true but I think the proof of the pudding is in the eating. Actually study the .lss for some functions and I think you'll be suitably impressed at what the optimizer does achieve - especially at -Os and -O3. The fact is that it's not often there are so many (ACTIVE!) autos that it has to spill out to RAM storage. Even with globals - while it obviously must STS if they are updated you will often see it keeping them cached in local registers throughout a routine with just one final store to update the globally visible copy at the end.

 

The obvious techniques of keeping variables in as limited a scope as possible also helps its ability to be able to optimize access. So don't use a global when it can be a local and use static as much as you can.

 

Of course the GCC optimiser (after the 64 or 128 passes or whatever it is that it does during optimization)  is only applying fixed rules and there are combinations in your code that manage to miss its best efforts to optimize - so called "missed optimizations". For those it's often the case that you can help the optimizer along a bit - perhaps manually reordering events or applying a cast here and there. Ultimately if you have a piece of very tie critical code and cannot persuade the compiler to do the best job possible then you may have to consider recoding that bit in Asm. If you plan to do this use -save-temps then use the contents of the .s file as your starting point and just fix the missed optimizations in that.

 

BTW if you are still in the development stage and you are hitting the flash limits of your chip and worrying about optimization to squeeze it in then you are probably developing for the wrong chip. Trade up to the next binary step in size within the family. Once you deploy there will be improvements and bug fixes you want to make and if you are already at the limit there may/will not be room for these.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
BTW if you are still in the development stage and you are hitting the flash limits of your chip and worrying about optimization to squeeze it in then you are probably developing for the wrong chip.

Absolutely!!

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I will say it really depends, all my extra test/debug code make my prototype bigger than the product! (and yes if I have plenty of space it stay in for ever.)

 

The product I work on at the moment started as a mega16, but moved to a 324 because I needed an extra uart, and it was cheaper than a 164, so the code are about 20k at the moment but could easy be made to fit 16k (8k would be pushing it unless I rewrote it in ASM). 

Last Edited: Tue. Jun 30, 2015 - 11:06 AM