GCC is an optimizing C compiler. The advantage of this is that, used right, like some of the other good AVR C compilers it can generate code that is perhaps only 10% larger/slower than hand crafted assembler code. There is, however a penalty to be paid in it trying to generate the most efficient code possible.
The first time this is usually encountered with those beginning to program AVR is in one of a number of places:
1) you build a C program then try to run it in AVR Studio's debugger or simulator. As you single step the code the yellow arrow that shows the point of execution appears to jump about and not follow the sequential path of execution you may have been expecting to see.
2) a program uses one or more interrupts and there are one or more variables used in both the interrupt handlers (ISR) and the main() code that appear to be ignored by the code in main. The ISR changes the variable but the main code does not "see" the change.
3) you try to insert a delay into the C code by simply using an empty for() loop counting from 0 to N but it doesn't delay at all.
4) you try to use one of the features of the AVR where Atmel dictate that two operations must happen within a fixed number of cycles (often 4 cycles) and you cannot make the sequence work (such as writing twice to the JTD bit to disable the JTAG interface)
5) You attempt to debug a program in the Studio debugger/simulator but when you try to add a variable to the "watch window" it always shows "variable not in scope"
6) You use the delay routines in
Looking at each of these in turn:
1) When GCC's optimiser is switched on (using a -O option other than -O0, such as -Os or -O3 say) it will optimize the result such as discarding code sections that apparently do nothing useful, re-using the same sequence of opcodes if it appears several times, inlining small functions, reordering code so that registers involved in one C statement may now be initialized long before they are actually used and many other techniques to reduce the size and increase the speed of the resultant code.
Without optimization there's usually a one to many relationship between each C source statement and the opcodes generated to implement it. When a disassembly listing (such as the .s or .lss file) is studied a single C statement may be found to have generate 5..10 or more AVR opcodes but each block of opcodes is distinct and identifiable. This example program:
int main(void) {
PORTB = 0x55;
PORTD = 0x55;
while(1);
}
generates:
PORTB = 0x55; 74: e8 e3 ldi r30, 0x38 ; 56 76: f0 e0 ldi r31, 0x00 ; 0 78: 85 e5 ldi r24, 0x55 ; 85 7a: 80 83 st Z, r24 PORTD = 0x55; 7c: e2 e3 ldi r30, 0x32 ; 50 7e: f0 e0 ldi r31, 0x00 ; 0 80: 85 e5 ldi r24, 0x55 ; 85 82: 80 83 st Z, r24
You don't need to know what all that (from the .lss file that was output) means - though a C programmer should have an understanding of the underlying Asm and what it means. But simplistically note that because the value 0x55 was used to set two different AVR registers (PORTB and PORTD) this has lead to the value 0x55 being loaded into R24 twice. This does mean that there's a separate block of opcodes for each C statement but it's very wasteful.
If the same code is built with optimization the result is much more compact and efficient:
PORTB = 0x55; 6c: 85 e5 ldi r24, 0x55 ; 85 6e: 88 bb out 0x18, r24 ; 24 PORTD = 0x55; 70: 82 bb out 0x12, r24 ; 18
In this new version the value 0x55 is only loaded into R24 once. On the one hand this is good as it's (in part) what makes the code more compact (the other being the switch from ST to the much more efficient OUT). But hopefully it's now obvious that the code used to implement "PORTB = 0x55" and "PORTD = 0x55" are effectively sharing an AVR opcode.
So there's no longer a simple connection between one C statement and one block of AVR opcodes. Now, in this simplest of examples, one opcode is being shared by two C statements. This is simply because the optimizer has recognized "why would I bother loading 0x55 into R24 again when I know it already contains that value?"
But it's more complex versions of this that can lead to the "yellow arrow" jumping about. Even just a small modification to the example:
int main(void) {
PORTB = 0x55;
PORTC = 0xAA; // this line added
PORTD = 0x55;
while(1);
}
leads to the code being:
PORTB = 0x55; 6c: 95 e5 ldi r25, 0x55 ; 85 6e: 98 bb out 0x18, r25 ; 24 PORTC = 0xAA; 70: 8a ea ldi r24, 0xAA ; 170 72: 85 bb out 0x15, r24 ; 21 PORTD = 0x55; 74: 92 bb out 0x12, r25 ; 18
The first and third line of the source now share "ldi r25, 0x55" and if you run this code in the simulator (it will build/simulate for mega16 and other processors) you will find that the yellow arrow starts on the opening brace of main() but on the next step it goes straight to the PORTC=0xAA line having skipped the PORTB=0x55 line, then it moves to the PORTD=0x55 line then it disappears in the while(1) at the end never to be seen again.
If you were just stepping the C this might lead you to believe that the PORTB=0x55 line had never been executed. However, the IO view in the simulator/debugger will show that all the statements were executed.
A very useful technique for debugging optimized C code is to start the debugger/simulator which will show the yellow arrow on the opening brace of main(). Now select "Disassembler" on the View menu. This opens a new window where the yellow arrow is now positioned on the first opcode of:
+00000036: E595 LDI R25,0x55 Load immediate +00000037: BB98 OUT 0x18,R25 Out to I/O location 6: PORTC = 0xAA; +00000038: EA8A LDI R24,0xAA Load immediate +00000039: BB85 OUT 0x15,R24 Out to I/O location 7: PORTD = 0x55; +0000003A: BB92 OUT 0x12,R25 Out to I/O location
While this window (and not the C source window) has the focus pressing the [Step Into] icon (or pressing F11) will execute just a single AVR opcode, not an entire C statement on each click/press. If you press it once the LDI is executed and the yellow arrow halts on the first OUT instruction. If you keep stepping you will be happy to see that ALL these statements are executed in turn and nothing is really being missed out in the execution of the program at all.
It's just that Studio's job is made a bit tricky when it can no longer just execute a single block of opcodes for each C statement. Hence the "yellow arrow jumps around" effect. If you are puzzled always switch to the mixed C/Asm view and follow the opcodes.
For points (2) and (3) above here is a program that demonstrates the two points:
#include#include char count; int main(void) { char i; count = 0; TIMSK = (1<<TOIE0); // overflow interrupts TCCR0 |= (1<<CS01); // start timer0 no prescale sei(); for (i=0; i<100; i++) { // just delay } while (count < 10) { PORTB = 0xAA; } PORTB = 0x55; while(1); } ISR(TIMER0_OVF_vect) { count++; }
When built for mega16 with -Os this shows another of the two commonest "gotchas" for optimization that beginners may not be aware of. The code generated is:
0000007c: char i; int main(void) { char i; count = 0; 7c: 10 92 61 00 sts 0x0061, r1 TIMSK = (1<<TOIE0); // overflow interrupts 80: 81 e0 ldi r24, 0x01 ; 1 82: 89 bf out 0x39, r24 ; 57 TCCR0 |= (1<<CS01); // start timer0 no prescale 84: 83 b7 in r24, 0x33 ; 51 86: 82 60 ori r24, 0x02 ; 2 88: 83 bf out 0x33, r24 ; 51 sei(); 8a: 78 94 sei for (i=0; i<100; i++) { // just delay } while (count < 10) { PORTB = 0xAA; 8c: 8a ea ldi r24, 0xAA ; 170 8e: 88 bb out 0x18, r24 ; 24 90: fe cf rjmp .-4 ; 0x8e
Again you don't have to be an AVR Asm expert and try to understand all of this (though if you can it's a really useful skill to have) but it's hopefully obvious that a lot of the program appears to have "gone missing"!
The for() loop does not seem to have generated any code and there's no sign of code using the value 0x55 and it appears to be infinitely stuck in the first while() loop. This is because:
a) The compiler, with optimization, will discard pointless code. After starting the timer there is a "delay" using a count from 0 to 99. That for() loop has generated no code whatsoever (the disassembly shows the source lines but no opcodes generated for it). This is because the delay has no inputs and no outputs so, to the compiler, that is trying to make the code as small and fast as possible, it just seems pointless - so it is discarded.
A simple solution is presented below but to be honest, if using AVR-LibC it makes far more sense to not try and code your own delay loops but, instead, use the _delay_ms(), _delay_us(), _delay_loop_1() and _delay_loop_2() found in
http://www.nongnu.org/avr-libc/u...
http://www.nongnu.org/avr-libc/u...
b) the real "gotcha" in the program example (and in the use of the optimizer in general) is the use of the variable 'count' in this program. The idea had been that the timer with interrupts would be started, the timer would interrupt for 10 overflows, incrementing 'count' each time - at least it does this bit OK:
ISR(TIMER0_OVF_vect) {
92: 1f 92 push r1
94: 0f 92 push r0
96: 0f b6 in r0, 0x3f ; 63
98: 0f 92 push r0
9a: 11 24 eor r1, r1
9c: 8f 93 push r24
count++;
9e: 80 91 61 00 lds r24, 0x0061
a2: 8f 5f subi r24, 0xFF ; 255
a4: 80 93 61 00 sts 0x0061, r24
}
a8: 8f 91 pop r24
aa: 0f 90 pop r0
ac: 0f be out 0x3f, r0 ; 63
ae: 0f 90 pop r0
b0: 1f 90 pop r1
b2: 18 95 reti
The "problem" is that the while loop in main() was supposed to keep an eye on 'count' and while it was less than 10 you may have been expecting to see some code that outputted 0xAA to PORTB. Then when main() saw that 'count' had exceeded 10 it should have finished the while(count<10) loop and output 0x55 instead followed by an infinite while(1) loop. Yet there's no sign of any code that would ever output 0x55 in what's been generated or fall into the final "while(1)" loop.
The reason is that as far as the compiler is concerned when it's compiling main() it enters with count=0 (all globals default to 0) and then there is no way (as far as main() and the compiler is concerned) that count can ever change value. The compiler cannot "see" or know that the value of 'count' may be changed in the separate ISR() function so it compiles the program as if it had been written as:
sei();
for (i=0; i<100; i++) {
// just delay
}
while (1) { // count would ALWAYS be less than 10
PORTB = 0xAA;
}
PORTB = 0x55;
while(1);
In which execution could never escape that first while() loop and the PORTB=0x55 and the final while(1) can never be reached. So the program ends at:
8c: 8a ea ldi r24, 0xAA ; 170 8e: 88 bb out 0x18, r24 ; 24 90: fe cf rjmp .-4 ; 0x8e
which repeatedly outputs 0xAA to PORTB just as the compiler believes it was asked to do.
If it's required that this program behave as originally written it's possible to tell the compiler that variables 'i' and 'count' must not be ignored by using the word 'volatile' which means "this variable is possibly subject to use elsewhere so you must always read/write it when told to". With the modification as follows:
volatile char count;
int main(void) {
volatile char i;
the code generated becomes:
sei();
94: 78 94 sei
for (i=0; i<100; i++) {
96: 19 82 std Y+1, r1 ; 0x01
98: 03 c0 rjmp .+6 ; 0xa0
9a: 89 81 ldd r24, Y+1 ; 0x01
9c: 8f 5f subi r24, 0xFF ; 255
9e: 89 83 std Y+1, r24 ; 0x01
a0: 89 81 ldd r24, Y+1 ; 0x01
a2: 84 36 cpi r24, 0x64 ; 100
a4: d0 f3 brcs .-12 ; 0x9a
a6: 02 c0 rjmp .+4 ; 0xac
// just delay
}
while (count < 10) {
PORTB = 0xAA;
a8: 98 bb out 0x18, r25 ; 24
aa: 01 c0 rjmp .+2 ; 0xae
ac: 9a ea ldi r25, 0xAA ; 170
TCCR0 |= (1<<CS01); // start timer0 no prescale
sei();
for (i=0; i<100; i++) {
// just delay
}
while (count < 10) {
ae: 80 91 60 00 lds r24, 0x0060
b2: 8a 30 cpi r24, 0x0A ; 10
b4: c8 f3 brcs .-14 ; 0xa8
PORTB = 0xAA;
}
PORTB = 0x55;
b6: 85 e5 ldi r24, 0x55 ; 85
b8: 88 bb out 0x18, r24 ; 24
ba: ff cf rjmp .-2 ; 0xba
in which the for() loop has generated some delaying code and a check is repeatedly kept on the 'count' variable and because the ISR will eventually increase it above 10 the code will then go on to output the value 0x55 and enter the final "rjmp .-2" which is the final while(1) loop.
4) the problem with timed sequences is that Atmel dictates that certain registers must be written twice within 4 cycles. For example a typical sequence to disable JTAG is:
MCUCSR = (1<<JTD); MCUCSR = (1<<JTD);
and to set a new value into CLKPR it's typically:
CLKPR = (1<<CLKPCE); CLKPR = 0;
The MCUCSR or CLKPR will be written using either OUT or STS depending on where they are located in memory. The requirement is that the two writing instructions happen within 4 cycles.
When this code is built with optimization enabled (-Os in this case) the sequences generated are:
MCUCSR = (1<<JTD); 6c: 80 e8 ldi r24, 0x80 ; 128 6e: 84 bf out 0x34, r24 ; 52 MCUCSR = (1<<JTD); 70: 84 bf out 0x34, r24 ; 52
and
CLKPR = (1<<CLKPCE); 80: 80 e8 ldi r24, 0x80 ; 128 82: 80 93 61 00 sts 0x0061, r24 CLKPR = 0; 86: 10 92 61 00 sts 0x0061, r1
In both of these the writes (OUT, STS) are so close there's no worries about them meeting the four cycle requirement.
If the same codes are built using -O0 to turn off optimization then the generated code is far more long winded and it's far more likely that the code will not meet the 4 cycle timing requirement:
74: e4 e5 ldi r30, 0x54 ; 84 76: f0 e0 ldi r31, 0x00 ; 0 78: 80 e8 ldi r24, 0x80 ; 128 7a: 80 83 st Z, r24 MCUCSR = (1<<JTD); 7c: e4 e5 ldi r30, 0x54 ; 84 7e: f0 e0 ldi r31, 0x00 ; 0 80: 80 e8 ldi r24, 0x80 ; 128 82: 80 83 st Z, r24
and
CLKPR = (1<<CLKPCE); 88: e1 e6 ldi r30, 0x61 ; 97 8a: f0 e0 ldi r31, 0x00 ; 0 8c: 80 e8 ldi r24, 0x80 ; 128 8e: 80 83 st Z, r24 CLKPR = 0; 90: e1 e6 ldi r30, 0x61 ; 97 92: f0 e0 ldi r31, 0x00 ; 0 94: 10 82 st Z, r1
Finally another way to "fix" the counting program would have been to build the entire program using -O0, this would have generated:
0000007c: #include #include char count; int main(void) { 7c: df 93 push r29 7e: cf 93 push r28 80: 0f 92 push r0 82: cd b7 in r28, 0x3d ; 61 84: de b7 in r29, 0x3e ; 62 char i; count = 0; 86: 10 92 60 00 sts 0x0060, r1 TIMSK = (1<<TOIE0); // overflow interrupts 8a: e9 e5 ldi r30, 0x59 ; 89 8c: f0 e0 ldi r31, 0x00 ; 0 8e: 81 e0 ldi r24, 0x01 ; 1 90: 80 83 st Z, r24 TCCR0 |= (1<<CS01); // start timer0 no prescale 92: a3 e5 ldi r26, 0x53 ; 83 94: b0 e0 ldi r27, 0x00 ; 0 96: e3 e5 ldi r30, 0x53 ; 83 98: f0 e0 ldi r31, 0x00 ; 0 9a: 80 81 ld r24, Z 9c: 82 60 ori r24, 0x02 ; 2 9e: 8c 93 st X, r24 sei(); a0: 78 94 sei for (i=0; i<100; i++) { a2: 19 82 std Y+1, r1 ; 0x01 a4: 03 c0 rjmp .+6 ; 0xac a6: 89 81 ldd r24, Y+1 ; 0x01 a8: 8f 5f subi r24, 0xFF ; 255 aa: 89 83 std Y+1, r24 ; 0x01 ac: 89 81 ldd r24, Y+1 ; 0x01 ae: 84 36 cpi r24, 0x64 ; 100 b0: d0 f3 brcs .-12 ; 0xa6 b2: 04 c0 rjmp .+8 ; 0xbc // just delay } while (count < 10) { PORTB = 0xAA; b4: e8 e3 ldi r30, 0x38 ; 56 b6: f0 e0 ldi r31, 0x00 ; 0 b8: 8a ea ldi r24, 0xAA ; 170 ba: 80 83 st Z, r24 TCCR0 |= (1<<CS01); // start timer0 no prescale sei(); for (i=0; i<100; i++) { // just delay } while (count < 10) { bc: 80 91 60 00 lds r24, 0x0060 c0: 8a 30 cpi r24, 0x0A ; 10 c2: c0 f3 brcs .-16 ; 0xb4 PORTB = 0xAA; } PORTB = 0x55; c4: e8 e3 ldi r30, 0x38 ; 56 c6: f0 e0 ldi r31, 0x00 ; 0 c8: 85 e5 ldi r24, 0x55 ; 85 ca: 80 83 st Z, r24 cc: ff cf rjmp .-2 ; 0xcc 000000ce <__vector_9>: while(1); } ISR(TIMER0_OVF_vect) { ce: 1f 92 push r1 d0: 0f 92 push r0 d2: 0f b6 in r0, 0x3f ; 63 d4: 0f 92 push r0 d6: 11 24 eor r1, r1 d8: 8f 93 push r24 da: df 93 push r29 dc: cf 93 push r28 de: cd b7 in r28, 0x3d ; 61 e0: de b7 in r29, 0x3e ; 62 count++; e2: 80 91 60 00 lds r24, 0x0060 e6: 8f 5f subi r24, 0xFF ; 255 e8: 80 93 60 00 sts 0x0060, r24 } ec: cf 91 pop r28 ee: df 91 pop r29 f0: 8f 91 pop r24 f2: 0f 90 pop r0 f4: 0f be out 0x3f, r0 ; 63 f6: 0f 90 pop r0 f8: 1f 90 pop r1 fa: 18 95 reti
I'll leave you to contemplate whether you really want your C compiler to be generating such long winded, slow and bloated code just so you don't have to think about using 'volatile' or so that the yellow arrow doesn't "jump about". I know what I'd want from a C compiler!
5) The watch window in AVR Studio is (usually) very simplistic. Each time code stops executing after either a single step or when it hits a breakpoint Studio redraws the contents of he watch window. It knows which locations in SRAM the variables are located at and it just reads what is in the locations and uses that to display the variable's current value.
This is all well and good as long as the code is updating the SRAM locations for a variable every time they are written (as happens with -O0 or 'volatile'). But one of the function of the optimizer is to recognise when it can simply hold a local copy of the variable in a machine register and not bother to update the copy in SRAM. Also sometimes a variable may never actually exist at all - in which case there'd never be a change of watching it.
Here is a simple program to demonstrate some of this:
#includeint main(void) { uint8_t a, b, c; a = 5; b = 7; c = a * b; PORTB = c; while(1); }
When this is built without optimization (-O0) the generated code is:
uint8_t a, b, c; a = 5; 78: 85 e0 ldi r24, 0x05 ; 5 7a: 8b 83 std Y+3, r24 ; 0x03 b = 7; 7c: 87 e0 ldi r24, 0x07 ; 7 7e: 8a 83 std Y+2, r24 ; 0x02 c = a * b; 80: 9b 81 ldd r25, Y+3 ; 0x03 82: 8a 81 ldd r24, Y+2 ; 0x02 84: 98 9f mul r25, r24 86: 80 2d mov r24, r0 88: 11 24 eor r1, r1 8a: 89 83 std Y+1, r24 ; 0x01 PORTB = c; 8c: e8 e3 ldi r30, 0x38 ; 56 8e: f0 e0 ldi r31, 0x00 ; 0 90: 89 81 ldd r24, Y+1 ; 0x01 92: 80 83 st Z, r24
The compiler creates the variables on the stack and uses the Y register to access them. 'a' is at RAM location 'Y+3', 'b' is at 'Y+2' and 'c' is at 'Y+1'. They are in RAM and they are updated each time they are written to by the "STD Y+n, Rn". This means that the Studio "watcher" has no problem showing you there current values as you step through this code.
Now consider the same program built with -Os but first consider what the intention of this entire program is. It's final output is to write a value to PORTB. The input values and 5 and 7 and sitting in your armchair you can already tell that 5*7 is 35 (or 0x23). So now look at what the optimizing compiler actually generates:
0000006c: uint8_t a, b, c; a = 5; b = 7; c = a * b; PORTB = c; 6c: 83 e2 ldi r24, 0x23 ; 35 6e: 88 bb out 0x18, r24 ; 24
Well that certainly does what the programmer intended and outputs 35/0x23 to the PORTB I/O location (0x18 for a mega16). But where are 'a', 'b' and 'c'? what RAM locations are they in? The answer is that they never existed and as such there was never any SRAM set aside to hold their values. Why should the compiler waste time and space to do this when all the program really does is output 35 to PORTB and so that's what the generated code does. Notionally a/b/c existed during compilation but the compiler could see that the only use for them was to be assigned 5 and 7 then the result of multiplying these. It can see that a,b,c are never used for anything else in the program so it might as well do the multiplication (5*7=35) at compile time rather than leaving it to be done by the AVR at runtime as was seen in the non-optimized code. An 80x86 processor running at several gigahertz is much quicker at multiplying 5 by 7 while it's compiling your AVR program for you than the AVR is at doing it. The final result (35) is known at compile time and there's no chance of it changing while the AVR is running so why bother leaving it to the AVR to do the math?
But the bottom line is that you won't be able to watch a, b or c in the optimized version of that program and any attempt to do so will just show "Location not valid". Note that is even true of 'c'. You might say that in the "LDI r24, 0x23" that R24 was effectively the 'c' variable ('c' holds 35 and so does R24) but (unlike some other debuggers) Studio is not quite this smart to make the association between the AVR's register number 24 and 'c'. It (usually) needs 'c' to be an actual location in SRAM for the debugger watch window to be able to "see" it.
While we're here I'll just mention another example of point (3) - the pointless for() loop using this example:
#includeint main(void) { uint8_t a; for (a=0; a<10; a++) { } PORTB = a; while(1); }
So you might think this will create a variable 'a' in RAM and have it count from 0 to 10 and then put 10 into PORTB. You'd be right about the very last part of that but not about 'a' being created in RAM or there being a counting loop:
0000006c: int main(void) { uint8_t a; for (a=0; a<10; a++) { } PORTB = a; 6c: 8a e0 ldi r24, 0x0A ; 10 6e: 88 bb out 0x18, r24 ; 24
All this program really does is output 10 to PORTB so that's all the compiler has generated.
Now at this stage you may be wondering (a) well why didn't it discard that bit and (b) why do I keep using PORTB/C/D in these examples. Well it comes down to this. What you know of as PORTB is what the C compiler really sees as:
(*(volatile uint8_t *)((0x18) + 0x20))
Don't worry about how complex this looks except to note our old friend 'volatile' in there. While PORTB is really just the label for IO location 0x18 (SRAM location 0x38 - hence the +0x20 above) this construct tells the compiler to treat that location as if it were volatile - so code must always be generated to write to that location.
Again this all comes down to the core function of the optimizer - it will discard code that does not do anything useful. All computer programs have inputs, outputs or both. If they have neither then they might as well not exist. By using 'volatile' you are saying "this thing is the "output" of this program so you must generate code to write to the output". If the program started by reading PINA as an input value then guess what, PINA is defined as being 'volatile' too so the compiler MUST go and read it before using the value it finds there.
6) the delay functions in
If you use _delay_ms(2) and build with optimization switched on then just like the c=a*b/c=35 example above the 80x86 processor in your PC (which is better at maths than your AVR!) will do all the calculations necessary while compiling because 2 (or rather 2.0) is a constant that is known at compile time. If however you use the function with -O0 (optimizer off) then you are saying "don't precalculate this during compilation but generate code so that the AVR will calculate it at run time". Unfortunately the floating point library code the AVR needs to do this is about 1K in size (and if you aren't using libm.a a far worse version that is 3K is used). So your AVR program "bloats" by 1K..3K if you use
This also explains why, if instead of using _delay_ms(2) you use _delay_ms(some_variable_that_changes) then even with optimization switched on the program drags in 1K..3K of floating point library code. This is because even with optimization the 80x86 can no longer do the sums at compile time and, instead, the AVR must do them as it runs.
.... more to come in the next edit ...






