[TUT][C]Optimization and the importance of volatile in GCC

Last post
28 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

GCC is an optimizing C compiler. The advantage of this is that, used right, like some of the other good AVR C compilers it can generate code that is perhaps only 10% larger/slower than hand crafted assembler code. There is, however a penalty to be paid in it trying to generate the most efficient code possible.

The first time this is usually encountered with those beginning to program AVR is in one of a number of places:

1) you build a C program then try to run it in AVR Studio's debugger or simulator. As you single step the code the yellow arrow that shows the point of execution appears to jump about and not follow the sequential path of execution you may have been expecting to see.

2) a program uses one or more interrupts and there are one or more variables used in both the interrupt handlers (ISR) and the main() code that appear to be ignored by the code in main. The ISR changes the variable but the main code does not "see" the change.

3) you try to insert a delay into the C code by simply using an empty for() loop counting from 0 to N but it doesn't delay at all.

4) you try to use one of the features of the AVR where Atmel dictate that two operations must happen within a fixed number of cycles (often 4 cycles) and you cannot make the sequence work (such as writing twice to the JTD bit to disable the JTAG interface)

5) You attempt to debug a program in the Studio debugger/simulator but when you try to add a variable to the "watch window" it always shows "variable not in scope"

6) You use the delay routines in but suddenly your program grows in size by 1..3K (this is actually an effect of not optimising, not caused when it is used)

Looking at each of these in turn:

1) When GCC's optimiser is switched on (using a -O option other than -O0, such as -Os or -O3 say) it will optimize the result such as discarding code sections that apparently do nothing useful, re-using the same sequence of opcodes if it appears several times, inlining small functions, reordering code so that registers involved in one C statement may now be initialized long before they are actually used and many other techniques to reduce the size and increase the speed of the resultant code.

Without optimization there's usually a one to many relationship between each C source statement and the opcodes generated to implement it. When a disassembly listing (such as the .s or .lss file) is studied a single C statement may be found to have generate 5..10 or more AVR opcodes but each block of opcodes is distinct and identifiable. This example program:

int main(void) {
	PORTB = 0x55;
	PORTD = 0x55;
	while(1);
}

generates:

	PORTB = 0x55;
  74:	e8 e3       	ldi	r30, 0x38	; 56
  76:	f0 e0       	ldi	r31, 0x00	; 0
  78:	85 e5       	ldi	r24, 0x55	; 85
  7a:	80 83       	st	Z, r24
	PORTD = 0x55;
  7c:	e2 e3       	ldi	r30, 0x32	; 50
  7e:	f0 e0       	ldi	r31, 0x00	; 0
  80:	85 e5       	ldi	r24, 0x55	; 85
  82:	80 83       	st	Z, r24

You don't need to know what all that (from the .lss file that was output) means - though a C programmer should have an understanding of the underlying Asm and what it means. But simplistically note that because the value 0x55 was used to set two different AVR registers (PORTB and PORTD) this has lead to the value 0x55 being loaded into R24 twice. This does mean that there's a separate block of opcodes for each C statement but it's very wasteful.

If the same code is built with optimization the result is much more compact and efficient:

	PORTB = 0x55;
  6c:	85 e5       	ldi	r24, 0x55	; 85
  6e:	88 bb       	out	0x18, r24	; 24
	PORTD = 0x55;
  70:	82 bb       	out	0x12, r24	; 18

In this new version the value 0x55 is only loaded into R24 once. On the one hand this is good as it's (in part) what makes the code more compact (the other being the switch from ST to the much more efficient OUT). But hopefully it's now obvious that the code used to implement "PORTB = 0x55" and "PORTD = 0x55" are effectively sharing an AVR opcode.

So there's no longer a simple connection between one C statement and one block of AVR opcodes. Now, in this simplest of examples, one opcode is being shared by two C statements. This is simply because the optimizer has recognized "why would I bother loading 0x55 into R24 again when I know it already contains that value?"

But it's more complex versions of this that can lead to the "yellow arrow" jumping about. Even just a small modification to the example:

int main(void) {
	PORTB = 0x55;
	PORTC = 0xAA; // this line added
	PORTD = 0x55;
	while(1);
}

leads to the code being:

	PORTB = 0x55;
  6c:	95 e5       	ldi	r25, 0x55	; 85
  6e:	98 bb       	out	0x18, r25	; 24
	PORTC = 0xAA;
  70:	8a ea       	ldi	r24, 0xAA	; 170
  72:	85 bb       	out	0x15, r24	; 21
	PORTD = 0x55;
  74:	92 bb       	out	0x12, r25	; 18

The first and third line of the source now share "ldi r25, 0x55" and if you run this code in the simulator (it will build/simulate for mega16 and other processors) you will find that the yellow arrow starts on the opening brace of main() but on the next step it goes straight to the PORTC=0xAA line having skipped the PORTB=0x55 line, then it moves to the PORTD=0x55 line then it disappears in the while(1) at the end never to be seen again.

If you were just stepping the C this might lead you to believe that the PORTB=0x55 line had never been executed. However, the IO view in the simulator/debugger will show that all the statements were executed.

A very useful technique for debugging optimized C code is to start the debugger/simulator which will show the yellow arrow on the opening brace of main(). Now select "Disassembler" on the View menu. This opens a new window where the yellow arrow is now positioned on the first opcode of:

+00000036:   E595        LDI       R25,0x55       Load immediate
+00000037:   BB98        OUT       0x18,R25       Out to I/O location
6:        	PORTC = 0xAA;
+00000038:   EA8A        LDI       R24,0xAA       Load immediate
+00000039:   BB85        OUT       0x15,R24       Out to I/O location
7:        	PORTD = 0x55;
+0000003A:   BB92        OUT       0x12,R25       Out to I/O location

While this window (and not the C source window) has the focus pressing the [Step Into] icon (or pressing F11) will execute just a single AVR opcode, not an entire C statement on each click/press. If you press it once the LDI is executed and the yellow arrow halts on the first OUT instruction. If you keep stepping you will be happy to see that ALL these statements are executed in turn and nothing is really being missed out in the execution of the program at all.

It's just that Studio's job is made a bit tricky when it can no longer just execute a single block of opcodes for each C statement. Hence the "yellow arrow jumps around" effect. If you are puzzled always switch to the mixed C/Asm view and follow the opcodes.

For points (2) and (3) above here is a program that demonstrates the two points:

#include 
#include 

char count;

int main(void) {
   char i;
   
   count = 0;
   TIMSK = (1<<TOIE0); // overflow interrupts
   TCCR0 |= (1<<CS01); // start timer0 no prescale
   sei();
   for (i=0; i<100; i++) {
      // just delay
   }
   while (count < 10) {
      PORTB = 0xAA;
   }
   PORTB = 0x55;
   while(1);
}

ISR(TIMER0_OVF_vect) {
  count++;
}

When built for mega16 with -Os this shows another of the two commonest "gotchas" for optimization that beginners may not be aware of. The code generated is:

0000007c 
: char i; int main(void) { char i; count = 0; 7c: 10 92 61 00 sts 0x0061, r1 TIMSK = (1<<TOIE0); // overflow interrupts 80: 81 e0 ldi r24, 0x01 ; 1 82: 89 bf out 0x39, r24 ; 57 TCCR0 |= (1<<CS01); // start timer0 no prescale 84: 83 b7 in r24, 0x33 ; 51 86: 82 60 ori r24, 0x02 ; 2 88: 83 bf out 0x33, r24 ; 51 sei(); 8a: 78 94 sei for (i=0; i<100; i++) { // just delay } while (count < 10) { PORTB = 0xAA; 8c: 8a ea ldi r24, 0xAA ; 170 8e: 88 bb out 0x18, r24 ; 24 90: fe cf rjmp .-4 ; 0x8e

Again you don't have to be an AVR Asm expert and try to understand all of this (though if you can it's a really useful skill to have) but it's hopefully obvious that a lot of the program appears to have "gone missing"!

The for() loop does not seem to have generated any code and there's no sign of code using the value 0x55 and it appears to be infinitely stuck in the first while() loop. This is because:

a) The compiler, with optimization, will discard pointless code. After starting the timer there is a "delay" using a count from 0 to 99. That for() loop has generated no code whatsoever (the disassembly shows the source lines but no opcodes generated for it). This is because the delay has no inputs and no outputs so, to the compiler, that is trying to make the code as small and fast as possible, it just seems pointless - so it is discarded.

A simple solution is presented below but to be honest, if using AVR-LibC it makes far more sense to not try and code your own delay loops but, instead, use the _delay_ms(), _delay_us(), _delay_loop_1() and _delay_loop_2() found in and which are explained on these two pages:

http://www.nongnu.org/avr-libc/user-manual/group__util__delay.html
http://www.nongnu.org/avr-libc/user-manual/group__util__delay__basic.html

b) the real "gotcha" in the program example (and in the use of the optimizer in general) is the use of the variable 'count' in this program. The idea had been that the timer with interrupts would be started, the timer would interrupt for 10 overflows, incrementing 'count' each time - at least it does this bit OK:

ISR(TIMER0_OVF_vect) {
  92:	1f 92       	push	r1
  94:	0f 92       	push	r0
  96:	0f b6       	in	r0, 0x3f	; 63
  98:	0f 92       	push	r0
  9a:	11 24       	eor	r1, r1
  9c:	8f 93       	push	r24
  count++;
  9e:	80 91 61 00 	lds	r24, 0x0061
  a2:	8f 5f       	subi	r24, 0xFF	; 255
  a4:	80 93 61 00 	sts	0x0061, r24
}
  a8:	8f 91       	pop	r24
  aa:	0f 90       	pop	r0
  ac:	0f be       	out	0x3f, r0	; 63
  ae:	0f 90       	pop	r0
  b0:	1f 90       	pop	r1
  b2:	18 95       	reti

The "problem" is that the while loop in main() was supposed to keep an eye on 'count' and while it was less than 10 you may have been expecting to see some code that outputted 0xAA to PORTB. Then when main() saw that 'count' had exceeded 10 it should have finished the while(count<10) loop and output 0x55 instead followed by an infinite while(1) loop. Yet there's no sign of any code that would ever output 0x55 in what's been generated or fall into the final "while(1)" loop.

The reason is that as far as the compiler is concerned when it's compiling main() it enters with count=0 (all globals default to 0) and then there is no way (as far as main() and the compiler is concerned) that count can ever change value. The compiler cannot "see" or know that the value of 'count' may be changed in the separate ISR() function so it compiles the program as if it had been written as:

   sei();
   for (i=0; i<100; i++) {
      // just delay
   }
   while (1) { // count would ALWAYS be less than 10
      PORTB = 0xAA;
   }
   PORTB = 0x55;
   while(1);

In which execution could never escape that first while() loop and the PORTB=0x55 and the final while(1) can never be reached. So the program ends at:

  8c:	8a ea       	ldi	r24, 0xAA	; 170
  8e:	88 bb       	out	0x18, r24	; 24
  90:	fe cf       	rjmp	.-4      	; 0x8e 

which repeatedly outputs 0xAA to PORTB just as the compiler believes it was asked to do.

If it's required that this program behave as originally written it's possible to tell the compiler that variables 'i' and 'count' must not be ignored by using the word 'volatile' which means "this variable is possibly subject to use elsewhere so you must always read/write it when told to". With the modification as follows:

volatile char count;

int main(void) {
   volatile char i;

the code generated becomes:

   sei();
  94:	78 94       	sei
   for (i=0; i<100; i++) {
  96:	19 82       	std	Y+1, r1	; 0x01
  98:	03 c0       	rjmp	.+6      	; 0xa0 
  9a:	89 81       	ldd	r24, Y+1	; 0x01
  9c:	8f 5f       	subi	r24, 0xFF	; 255
  9e:	89 83       	std	Y+1, r24	; 0x01
  a0:	89 81       	ldd	r24, Y+1	; 0x01
  a2:	84 36       	cpi	r24, 0x64	; 100
  a4:	d0 f3       	brcs	.-12     	; 0x9a 
  a6:	02 c0       	rjmp	.+4      	; 0xac 
      // just delay
   }
   while (count < 10) {
      PORTB = 0xAA;
  a8:	98 bb       	out	0x18, r25	; 24
  aa:	01 c0       	rjmp	.+2      	; 0xae 
  ac:	9a ea       	ldi	r25, 0xAA	; 170
   TCCR0 |= (1<<CS01); // start timer0 no prescale
   sei();
   for (i=0; i<100; i++) {
      // just delay
   }
   while (count < 10) {
  ae:	80 91 60 00 	lds	r24, 0x0060
  b2:	8a 30       	cpi	r24, 0x0A	; 10
  b4:	c8 f3       	brcs	.-14     	; 0xa8 
      PORTB = 0xAA;
   }
   PORTB = 0x55;
  b6:	85 e5       	ldi	r24, 0x55	; 85
  b8:	88 bb       	out	0x18, r24	; 24
  ba:	ff cf       	rjmp	.-2      	; 0xba 

in which the for() loop has generated some delaying code and a check is repeatedly kept on the 'count' variable and because the ISR will eventually increase it above 10 the code will then go on to output the value 0x55 and enter the final "rjmp .-2" which is the final while(1) loop.

4) the problem with timed sequences is that Atmel dictates that certain registers must be written twice within 4 cycles. For example a typical sequence to disable JTAG is:

MCUCSR = (1<<JTD);
MCUCSR = (1<<JTD);

and to set a new value into CLKPR it's typically:

CLKPR = (1<<CLKPCE);
CLKPR = 0;

The MCUCSR or CLKPR will be written using either OUT or STS depending on where they are located in memory. The requirement is that the two writing instructions happen within 4 cycles.

When this code is built with optimization enabled (-Os in this case) the sequences generated are:

	MCUCSR = (1<<JTD);
  6c:	80 e8       	ldi	r24, 0x80	; 128
  6e:	84 bf       	out	0x34, r24	; 52
	MCUCSR = (1<<JTD);
  70:	84 bf       	out	0x34, r24	; 52

and

	CLKPR = (1<<CLKPCE);
  80:	80 e8       	ldi	r24, 0x80	; 128
  82:	80 93 61 00 	sts	0x0061, r24
	CLKPR = 0;
  86:	10 92 61 00 	sts	0x0061, r1

In both of these the writes (OUT, STS) are so close there's no worries about them meeting the four cycle requirement.

If the same codes are built using -O0 to turn off optimization then the generated code is far more long winded and it's far more likely that the code will not meet the 4 cycle timing requirement:

  74:	e4 e5       	ldi	r30, 0x54	; 84
  76:	f0 e0       	ldi	r31, 0x00	; 0
  78:	80 e8       	ldi	r24, 0x80	; 128
  7a:	80 83       	st	Z, r24
	MCUCSR = (1<<JTD);
  7c:	e4 e5       	ldi	r30, 0x54	; 84
  7e:	f0 e0       	ldi	r31, 0x00	; 0
  80:	80 e8       	ldi	r24, 0x80	; 128
  82:	80 83       	st	Z, r24

and

	CLKPR = (1<<CLKPCE);
  88:	e1 e6       	ldi	r30, 0x61	; 97
  8a:	f0 e0       	ldi	r31, 0x00	; 0
  8c:	80 e8       	ldi	r24, 0x80	; 128
  8e:	80 83       	st	Z, r24
	CLKPR = 0;
  90:	e1 e6       	ldi	r30, 0x61	; 97
  92:	f0 e0       	ldi	r31, 0x00	; 0
  94:	10 82       	st	Z, r1

Finally another way to "fix" the counting program would have been to build the entire program using -O0, this would have generated:

0000007c 
: #include #include char count; int main(void) { 7c: df 93 push r29 7e: cf 93 push r28 80: 0f 92 push r0 82: cd b7 in r28, 0x3d ; 61 84: de b7 in r29, 0x3e ; 62 char i; count = 0; 86: 10 92 60 00 sts 0x0060, r1 TIMSK = (1<<TOIE0); // overflow interrupts 8a: e9 e5 ldi r30, 0x59 ; 89 8c: f0 e0 ldi r31, 0x00 ; 0 8e: 81 e0 ldi r24, 0x01 ; 1 90: 80 83 st Z, r24 TCCR0 |= (1<<CS01); // start timer0 no prescale 92: a3 e5 ldi r26, 0x53 ; 83 94: b0 e0 ldi r27, 0x00 ; 0 96: e3 e5 ldi r30, 0x53 ; 83 98: f0 e0 ldi r31, 0x00 ; 0 9a: 80 81 ld r24, Z 9c: 82 60 ori r24, 0x02 ; 2 9e: 8c 93 st X, r24 sei(); a0: 78 94 sei for (i=0; i<100; i++) { a2: 19 82 std Y+1, r1 ; 0x01 a4: 03 c0 rjmp .+6 ; 0xac a6: 89 81 ldd r24, Y+1 ; 0x01 a8: 8f 5f subi r24, 0xFF ; 255 aa: 89 83 std Y+1, r24 ; 0x01 ac: 89 81 ldd r24, Y+1 ; 0x01 ae: 84 36 cpi r24, 0x64 ; 100 b0: d0 f3 brcs .-12 ; 0xa6 b2: 04 c0 rjmp .+8 ; 0xbc // just delay } while (count < 10) { PORTB = 0xAA; b4: e8 e3 ldi r30, 0x38 ; 56 b6: f0 e0 ldi r31, 0x00 ; 0 b8: 8a ea ldi r24, 0xAA ; 170 ba: 80 83 st Z, r24 TCCR0 |= (1<<CS01); // start timer0 no prescale sei(); for (i=0; i<100; i++) { // just delay } while (count < 10) { bc: 80 91 60 00 lds r24, 0x0060 c0: 8a 30 cpi r24, 0x0A ; 10 c2: c0 f3 brcs .-16 ; 0xb4 PORTB = 0xAA; } PORTB = 0x55; c4: e8 e3 ldi r30, 0x38 ; 56 c6: f0 e0 ldi r31, 0x00 ; 0 c8: 85 e5 ldi r24, 0x55 ; 85 ca: 80 83 st Z, r24 cc: ff cf rjmp .-2 ; 0xcc 000000ce <__vector_9>: while(1); } ISR(TIMER0_OVF_vect) { ce: 1f 92 push r1 d0: 0f 92 push r0 d2: 0f b6 in r0, 0x3f ; 63 d4: 0f 92 push r0 d6: 11 24 eor r1, r1 d8: 8f 93 push r24 da: df 93 push r29 dc: cf 93 push r28 de: cd b7 in r28, 0x3d ; 61 e0: de b7 in r29, 0x3e ; 62 count++; e2: 80 91 60 00 lds r24, 0x0060 e6: 8f 5f subi r24, 0xFF ; 255 e8: 80 93 60 00 sts 0x0060, r24 } ec: cf 91 pop r28 ee: df 91 pop r29 f0: 8f 91 pop r24 f2: 0f 90 pop r0 f4: 0f be out 0x3f, r0 ; 63 f6: 0f 90 pop r0 f8: 1f 90 pop r1 fa: 18 95 reti

I'll leave you to contemplate whether you really want your C compiler to be generating such long winded, slow and bloated code just so you don't have to think about using 'volatile' or so that the yellow arrow doesn't "jump about". I know what I'd want from a C compiler!

5) The watch window in AVR Studio is (usually) very simplistic. Each time code stops executing after either a single step or when it hits a breakpoint Studio redraws the contents of he watch window. It knows which locations in SRAM the variables are located at and it just reads what is in the locations and uses that to display the variable's current value.

This is all well and good as long as the code is updating the SRAM locations for a variable every time they are written (as happens with -O0 or 'volatile'). But one of the function of the optimizer is to recognise when it can simply hold a local copy of the variable in a machine register and not bother to update the copy in SRAM. Also sometimes a variable may never actually exist at all - in which case there'd never be a change of watching it.

Here is a simple program to demonstrate some of this:

#include 

int main(void) {
   uint8_t a, b, c;

   a = 5;
   b = 7;
   c = a * b;
   PORTB = c;   
   while(1);
}

When this is built without optimization (-O0) the generated code is:

   uint8_t a, b, c;

   a = 5;
  78:	85 e0       	ldi	r24, 0x05	; 5
  7a:	8b 83       	std	Y+3, r24	; 0x03
   b = 7;
  7c:	87 e0       	ldi	r24, 0x07	; 7
  7e:	8a 83       	std	Y+2, r24	; 0x02
   c = a * b;
  80:	9b 81       	ldd	r25, Y+3	; 0x03
  82:	8a 81       	ldd	r24, Y+2	; 0x02
  84:	98 9f       	mul	r25, r24
  86:	80 2d       	mov	r24, r0
  88:	11 24       	eor	r1, r1
  8a:	89 83       	std	Y+1, r24	; 0x01
   PORTB = c;   
  8c:	e8 e3       	ldi	r30, 0x38	; 56
  8e:	f0 e0       	ldi	r31, 0x00	; 0
  90:	89 81       	ldd	r24, Y+1	; 0x01
  92:	80 83       	st	Z, r24

The compiler creates the variables on the stack and uses the Y register to access them. 'a' is at RAM location 'Y+3', 'b' is at 'Y+2' and 'c' is at 'Y+1'. They are in RAM and they are updated each time they are written to by the "STD Y+n, Rn". This means that the Studio "watcher" has no problem showing you there current values as you step through this code.

Now consider the same program built with -Os but first consider what the intention of this entire program is. It's final output is to write a value to PORTB. The input values and 5 and 7 and sitting in your armchair you can already tell that 5*7 is 35 (or 0x23). So now look at what the optimizing compiler actually generates:

0000006c 
: uint8_t a, b, c; a = 5; b = 7; c = a * b; PORTB = c; 6c: 83 e2 ldi r24, 0x23 ; 35 6e: 88 bb out 0x18, r24 ; 24

Well that certainly does what the programmer intended and outputs 35/0x23 to the PORTB I/O location (0x18 for a mega16). But where are 'a', 'b' and 'c'? what RAM locations are they in? The answer is that they never existed and as such there was never any SRAM set aside to hold their values. Why should the compiler waste time and space to do this when all the program really does is output 35 to PORTB and so that's what the generated code does. Notionally a/b/c existed during compilation but the compiler could see that the only use for them was to be assigned 5 and 7 then the result of multiplying these. It can see that a,b,c are never used for anything else in the program so it might as well do the multiplication (5*7=35) at compile time rather than leaving it to be done by the AVR at runtime as was seen in the non-optimized code. An 80x86 processor running at several gigahertz is much quicker at multiplying 5 by 7 while it's compiling your AVR program for you than the AVR is at doing it. The final result (35) is known at compile time and there's no chance of it changing while the AVR is running so why bother leaving it to the AVR to do the math?

But the bottom line is that you won't be able to watch a, b or c in the optimized version of that program and any attempt to do so will just show "Location not valid". Note that is even true of 'c'. You might say that in the "LDI r24, 0x23" that R24 was effectively the 'c' variable ('c' holds 35 and so does R24) but (unlike some other debuggers) Studio is not quite this smart to make the association between the AVR's register number 24 and 'c'. It (usually) needs 'c' to be an actual location in SRAM for the debugger watch window to be able to "see" it.

While we're here I'll just mention another example of point (3) - the pointless for() loop using this example:

#include 

int main(void) {
   uint8_t a;

   for (a=0; a<10; a++) {
   }
   PORTB = a;   
   while(1);
}

So you might think this will create a variable 'a' in RAM and have it count from 0 to 10 and then put 10 into PORTB. You'd be right about the very last part of that but not about 'a' being created in RAM or there being a counting loop:

0000006c 
: int main(void) { uint8_t a; for (a=0; a<10; a++) { } PORTB = a; 6c: 8a e0 ldi r24, 0x0A ; 10 6e: 88 bb out 0x18, r24 ; 24

All this program really does is output 10 to PORTB so that's all the compiler has generated.

Now at this stage you may be wondering (a) well why didn't it discard that bit and (b) why do I keep using PORTB/C/D in these examples. Well it comes down to this. What you know of as PORTB is what the C compiler really sees as:

(*(volatile uint8_t *)((0x18) + 0x20))

Don't worry about how complex this looks except to note our old friend 'volatile' in there. While PORTB is really just the label for IO location 0x18 (SRAM location 0x38 - hence the +0x20 above) this construct tells the compiler to treat that location as if it were volatile - so code must always be generated to write to that location.

Again this all comes down to the core function of the optimizer - it will discard code that does not do anything useful. All computer programs have inputs, outputs or both. If they have neither then they might as well not exist. By using 'volatile' you are saying "this thing is the "output" of this program so you must generate code to write to the output". If the program started by reading PINA as an input value then guess what, PINA is defined as being 'volatile' too so the compiler MUST go and read it before using the value it finds there.

6) the delay functions in have already been mentioned (as a better alternative than trying to use empty for() loops to create delays). They do have a "gotcha" of their own though. The functions _delay_ms(N) and _delay_us(N) are defined to take an input 'N' as floating point variable. So if you say _delay_ms(2) it's really _delay_ms(2.0) and then the code in _delay_ms() does a floating point calculation using this "2.0" to work out how many times _delay_loop_2() must be called to achieve the number of milliseconds requested.

If you use _delay_ms(2) and build with optimization switched on then just like the c=a*b/c=35 example above the 80x86 processor in your PC (which is better at maths than your AVR!) will do all the calculations necessary while compiling because 2 (or rather 2.0) is a constant that is known at compile time. If however you use the function with -O0 (optimizer off) then you are saying "don't precalculate this during compilation but generate code so that the AVR will calculate it at run time". Unfortunately the floating point library code the AVR needs to do this is about 1K in size (and if you aren't using libm.a a far worse version that is 3K is used). So your AVR program "bloats" by 1K..3K if you use functions and don't have optimization switched on.

This also explains why, if instead of using _delay_ms(2) you use _delay_ms(some_variable_that_changes) then even with optimization switched on the program drags in 1K..3K of floating point library code. This is because even with optimization the 80x86 can no longer do the sums at compile time and, instead, the AVR must do them as it runs.

.... more to come in the next edit ...

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Oh how I wish I'd you'd written (and I'd subsequently read) this months ago... I spent hours trying to figure out what was wrong with my ISRs when I started coding in C :evil:

As for the optimizing, is there any way, short of writing out the asm, of getting gcc to leave certain blocks of code alone?

Come to that am I right in thinking that asm is left undisturbed?

Thank you for taking the time to write this up :D

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

As for the optimizing, is there any way, short of writing out the asm, of getting gcc to leave certain blocks of code alone?

Come to that am I right in thinking that asm is left undisturbed?


Phil,

Depends what you mean. If you are talking about inline Asm (basically anything that the C compiler gets it's hands on) there are no absolute guarantees about ordering. If, however the Asm code is in separate .S files only presented to the assembler then you have no worries about it being "mangled". Obviously to get from a .c to a .S involves a CALL and a RET - this is the price you pay for getting the Asm advantage.

You may want to search out a thread in the GCC forum from the last month where the use of the word volatile in the context of "asm volatile ("...")" was explored at length. The bottom line was that volatile in this context does not mean what the above article might have you believe it would mean - there's no guarantees about code ordering. This even goes as far as the library routines sei() and cli() and whether you can guarantee the actual SEI and CLI opcodes appearing at the exact point you may have positioned them in a .c file.

But this is really the subject for a different tutorial as the use of volatile in that context (a bit like the multiple meanings of the 'static' keyword) is different from it's use to guarantee read/write access to variables.

Cliff

PS I've had the following signature graphic for a long time - there's a reason why FAQ#1 is where it appears ;-)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
bit like the multiple meanings of the 'static' keyword
Indeed, I'm still coming to grips with this one :)

As for the "asm volatile" stuff, I've seen it around and sort of guessed it was something of the sort but not yet found myself in a situation where I needed to find out more... The thought just popped into my head while reading your tutorial. Shall read the thread now:)

Quote:
Asm code is in separate .S
and this I've not yet seen, shall have to file this nugget away for the right occasion.

Quote:
I've had the following signature graphic for a long time
Atmel should look into putting this on their packaging, especially 1, 3 & 5.

Thanks for the response Cliff.

Phil

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

and this I've not yet seen, shall have to file this nugget away for the right occasion.

Start here:

http://www.nongnu.org/avr-libc/user-manual/group__asmdemo.html

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
You may want to search out a thread in the GCC forum from the last month where the use of the word volatile in the context of "asm volatile ("...")" was explored at length. The bottom line was that volatile in this context does not mean what the above article might have you believe it would mean - there's no guarantees about code ordering. This even goes as far as the library routines sei() and cli() and whether you can guarantee the actual SEI and CLI opcodes appearing at the exact point you may have positioned them in a .c file.
In a way, asm volatile is a complement to the other volatile.
One says something might be done to me behind your back.
The other says I might do something behind your back.
Quote:
But this is really the subject for a different tutorial as the use of volatile in that context (a bit like the multiple meanings of the 'static' keyword) is different from it's use to guarantee read/write access to variables.

Is it racist to discriminate against someone who changes species?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It's volatile not volaltile :)

John Samperi

Ampertronics Pty. Ltd.

www.ampertronics.com.au

* Electronic Design * Custom Products * Contract Assembly

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

John,

The irony of that is that it's my 'l' key that normally doesn't work right yet this time it's done over-time! :lol:

Cliff

(I'll correct it when I make the next major edit above - I've already thought of (5) and (6) to add above ;-) and I got some useful feedback from Jan)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
You might say that in the "LDI r24, 0x23" that R24 was effectively the 'c' variable ('c' holds 35 and so does R24) but (unlike some other debuggers) Studio is not quite this smart to make the association between the AVR's register number 24 and 'c'. It (usually) needs 'c' to be an actual location in SRAM for the debugger watch window to be able to "see" it.
From this I infer that at least some compilers emit
the kind of debugging information necessary to follow
a variable when it is not stored in main memory.
Do they also emit the information necessary
when the variable changes locations?

In the case at hand, R24 shares an address space with SRAM.
Is the problem that c is local and memory location 24 is global?

Is it racist to discriminate against someone who changes species?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

From this I infer that at least some compilers emit
the kind of debugging information necessary to follow
a variable when it is not stored in main memory.
Do they also emit the information necessary
when the variable changes locations?

In the case at hand, R24 shares an address space with SRAM.
Is the problem that c is local and memory location 24 is global?


Michael, I think you'd have to ask the AVR Studio developers about that. You and I know that R24 is SRAM address 0x0018 so it should be possible to watch it but I can only assume that either the ELF doesn't contain the info to say 'c' is just R24 or Atmel's programmers don't parse this out of the ELF data and make the association.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
a thread in the GCC forum from the last month where the use of the word volatile in the context of "asm volatile ("...")" was explored at length
Should anyone be interested, I think this is the thread Cliff was talking about. http://www.avrfreaks.net/index.php?name=PNphpBB2&file=viewtopic&t=94571 Well worth a read:)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks for this, great stuff!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Also, thankyou from here too!... Ran into the jumping arrow/weird debug/no variable watch problems exactly as described above. It makes complete sense after reading your excellent tutorial on the topic.
Thankyou!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi Cliff.
I just signed up as a member, and I did so simply because I admired your posting on this topic.
It is very eloquent, yet concise.
Well done !
When I grow up, I want to be like you.
Thanks,
-Karl

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

When I grow up, I want to be like you.

Me too - but the chances of me ever growing up at this stage seem very remote.

Cliff (48years 18days)

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I wonder what my chances are...

Karl = 56Y 269D

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Grow old?

I was not born... I was downloaded!!!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks for taking your time to write this tutorial! It was very helpful!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks Clawson, this is really helpful. Wihtout this knowledge, I was going nuts to understand debugger results.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
This also explains why, if instead of using _delay_ms(2) you use _delay_ms(some_variable_that_changes) then even with optimization switched on the program drags in 1K..3K of floating point library code.

Cliff I remember you mentioned in another post that _delay_ms() is to be used as base delay fixed at compile time in order to build variable delays. If I try to pass a variable to this built-in function, the compiler complains

AVR Studio 5 wrote:
Error 2 __builtin_avr_delay_cycles expects an integer constant. util/delay.h 152 28

Great tutorial. Because AS5 defaults to -O1, I had been wondering why the sample code that uses an empty loop for delay declares volatile. I'll switch that to -Os now.

Other than the "deterministic ëmpty loop" that an optimiser drops, is there a rule of thumb to decide when to use volatile to defeat the optimiser and when to encourage optimisation?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

the compiler complains

Yes, you are using the later version of delay.h which has been fixed to emit an error when it is used wrong - this is a distinct improvement as it forces users to actually bother reading the manual rather than simply guessing at how the function probably works.

As for when to optimise: always and often - why would you want a compiler to generate sub-optimal code?

(debugging is the one exception but you don't deploy the debug code - you deploy the release code and you want it as efficient as possible).

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:
(debugging is the one exception but you don't deploy the debug code - you deploy the release code and you want it as efficient as possible).

(BTW, AS5 only defaults to -O1 in Debug configuration; in "Release" configuration, the default is -Os. Debug, in turn, is the default configuration for new project template as well as in all sample applications I downloaded from Atmel.) When I wrote my first program from scratch (using a blank template), I didn't use volatile. Tests worked as expected. Had I changed to Release configuration, I would be unpleasantly surprised to see all my elaborate arithmetic only resulted in a steady blink. (After reading this, I purposely removed volatile and switched to Release. Yes, all variations are gone.)

The tutorial seems to suggest that I shouldn't be throwing volatile to all variables or compiler will not optimise as much as it should. How do I know when I should force volatile even though I do not expect another program or peripheral to alter a variable? I can see that interrupt handler should be considered another program. But there's another example in which a non-empty loop also needs a volatile declaration to maintain the "expected," or designed, behaviour. Should every "loop variable" be volatile?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Atmel, you have to realise, are like kids with a new toy who haven't quite worked out how it actually works. Any regular user of avr-gcc knows that the -O0 they use for "Debug" is not acceptable in any circumstance. Some idiot at Atmel thought this was the right choice for "Debug" build. It is not. It's only purpose is for the compiler writers to check the unoptimized code that is initially generated. While it does make programs that can be easily debugged because of the straight one-to-many relation between C source and generated Asm it is utterly and totally pointless to debug that code as it, in no way, represents the code you are finally going to be running. Try adjusting the watchdog or changing the CLKPR register or set the JTD bit using a program built -O0 and it will never work. At the very least Atmel should have used -O1 for "Debug" but even then it's going to behave differently to the final -Os/-O3 program. What Atmel should do is improve the debugger so it can more easily track locals that are cached into registers and encourage users to debug -Os code.

As I say the best solution is to use -Os and simply don't use 'volatile'. Instead debug in the mixed C+Asm view (by which you also quickly learn AVR assembler) and work out which machine registers locals are cached into and watch those rather than using the "Watch window". If you have a problem with this then only while debugging a localised section of the code just temporarily make any locals you think you need to watch there 'volatile'. When you are happy that code section works then remove the volatile.

This rule for volatile has nothing to do with FAQ#1. Any variable that is accessed within two threads of execution must ALWAYS be volatile whether you want to watch it in a debugger or not.

But keep this thought in mind - every time you make a variable volatile you make your program bigger and slower. So do it with care and don't hand out the 'volatile's like they were sweeties.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

we are using -Os -mcall-prologues optimization.
Avr_libc_user manual say that this is the most universal best optimization level can you explain about this optimization and also what all the consideration to be taken care for using this optimization.

P.Ashok Kumar

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

we are using -Os -mcall-prologues optimization.
Avr_libc_user manual say that this is the most universal best optimization level can you explain about this optimization and also what all the consideration to be taken care for using this optimization.

Normally each ISR has a unique "prologue" and "epilogue" - a few housekeeping instructions to push and pop various portions of the AVR's register set which the ISR needs for its own use, to prevent the existing values from being lost. This is great for low-latency ISRs, as only the registers used are saved and restored.

If you have a large set of ISRs each with a lengthy prologue and epilogue, there can be a lot of flash memory wasted storing the individual ISR prologue/epilogue code. To combat this, you can use the -mcall-prologues switch. This will make every ISR call a single unified ISR prologue and epilogue routine, which will in turn save and restore the entire AVR register set regardless of which registers are actually required in the ISR.

The downside is that every ISR now has a lot more latency, due to the extra CALL/RET instruction pairs to jump to the unified prologue/epilogue functions, and because you now have to wait for all the registers to be saved and restored regardless of which are used. The upside is a space savings if the space taken up by the one unified prologue/epilogue function pair is less than the individual ISR prologue/epilogue sequences.

TLDR; It increases ISR latency, but will reduce overall flash memory consumption if you have a lot of complex ISR handlers in your application.

- Dean :twisted:

Make Atmel Studio better with my free extensions. Open source and feedback welcome!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks Cliff and Dean :D

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This really gives me a great help! Thanks