Newby In Need of Debugging Tips.

Go To Last Post
25 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I am a software engineer but my background is in windows application development. My first exposure to the AVR has turned into a trial by fire scenario. I have taken over development of a system with an ATMega32 that provides a serial user interface to a PC, does some basic control and monitoring functions, and translates I/O from two legacy HC11 controllers through a pair of external UARTs. This equates to about 2500 lines of C code. A contractor wrote the ATMega32 code so I am not completely familiar with it.

The system is functional but has a few small bugs (and one BIG one). After running for several days the AVR will simply stop responding to the internal serial port. I have only seen this happen if the AVR is in a mode where the it is continually polling both HC11’s through the external UARTs. It also stops sending status requests to the HC11’s. I’m guessing it’s stuck in a loop somewhere. Without some type of trace functionality I’m having a hard time locating it. I’m also wondering if a stack overrun, or memory leak might cause this type of behavior.

Can someone suggest a good primer or reference for debugging? I need to know how to introduce serial data, generate timer interrupts, and step through interrupt routines. The only tools I have so far are CodevisionAVR, AVRStudio 4 and an STK500. There appear to be a lot of ICE options. Any suggestion on what is the best bang-for-the-buck?

Terry Slocum

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There is a recent thread here on how to do stack depth checking... basically declare char lastvar, init it to 0xaa, run the program for a while, and check to see if the 0xaa got written over. Does your app use interrupts? leaving an address on the stack accidentally after some sort of reset would slowly grow the stack, but 2 days? I mean, theres only a couple K to fill up. There are several CV experts here. I'm more familiar with the imagecraft compiler.

Imagecraft compiler user

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Wow. :roll: A 2 day bug could be tough to find, especially in code you're not familiar with. Is it a consistent time period or is it an average, like "Likely to stop in about 2 days?"

With only 2k of memory, I'd expect a memory leak type bug would show up lots faster than that, though if the leak is a byte every 10 minutes or so... If the time period is consistent, I'd look for an event counter wrapping around and goofing up a comparison because it's no longer greater than it was.

If the timing of the failure is random, and just averages to about 2 days, I'd suspect a timing problem where it goofs up if event a happens before event b but almost always happens after b.

---
Formerly Torby. Stitch626 just seemed a more descriptive nicname.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

bobgardner wrote:
There is a recent thread here on how to do stack depth checking... basically declare char lastvar, init it to 0xaa, run the program for a while, and check to see if the 0xaa got written over.Does your app use interrupts? leaving an address on the stack accidentally after some sort of reset would slowly grow the stack, but 2 days? I mean, theres only a couple K to fill up. There are several CV experts here. I'm more familiar with the imagecraft compiler.

I think I found the thread. Thanks. Still looking through it. I still don't understand how I check the memory with the program running in the circuit?

Yes it uses ext interrupts to get data from the ext. UART's. There is also a free running global counter that increments on a timer interrupt and is used for time delays. This is written in C, so as far as I know the compiler takes care of the stack. Am I missing something?

Thanks for your help.

Terry

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Stitch626 wrote:
Wow. :roll: A 2 day bug could be tough to find, especially in code you're not familiar with. Is it a consistent time period or is it an average, like "Likely to stop in about 2 days?"

With only 2k of memory, I'd expect a memory leak type bug would show up lots faster than that, though if the leak is a byte every 10 minutes or so... If the time period is consistent, I'd look for an event counter wrapping around and goofing up a comparison because it's no longer greater than it was.

If the timing of the failure is random, and just averages to about 2 days, I'd suspect a timing problem where it goofs up if event a happens before event b but almost always happens after b.

Actually I said several days, as in 2 to 4. :cry: It seems pretty random. The reason I started thinking stack overrun is that I added some function calls to write ID# into eeprom at certain points in the code, as a kind of poor mans trace. When I ran it, after a half hour some of the global variables (the last ones declared) changed there values. I couldn't continue the test.

What I really need is a way to emulate this failure.

Terry

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You have to turn off interrupts during eeprom write. Maybe that threw a monkey wrench into the works?

Imagecraft compiler user

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

In your instance, I think the best course of action is an in depth code review. Hopefully this may identify suspicious points in the design of the code. From what you've described - your bug may be one of the following:
1/ stack overflow
2/ critical variables - have shared variables (between main code and interrupt code) been handled correctly? As in the interrupts being disabled when reading them. These can cause irregular problems that are hard to track down.

3/ program logic.

During the review - identify critical points in the code and add in simple tests to try to narrow down the problem - like turn on a led or invert the led.

As for a debugger - I tend to use the above method!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi,

I would strongly recommend a JTAGICE (Mk I or Mk II) for debugging your Mega32. However, it needs a few dedicated IO pins which might have already been allocated in the design.

Good Luck

Pete

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

bobgardner wrote:
You have to turn off interrupts during eeprom write. Maybe that threw a monkey wrench into the works?

Yikes! I just copied the way other data was written to EEPROM. I’ve been looking and can’t find any place where the interrupts were turned off before writing to EEPROM.

Terry

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi,

In your shoes I would buy an JTAGICE mkII as suggested above (provided you can access tho necessary I/O:s), as this is a very good tool for debugging. If you use IAR:s embedded workbench you can use it directly from there otherwise the free AVR Studio is a good tool. After the fault occurs you can stop the system and check variable values etc and try to figure out what happens.

Otherwise you may have to go back to old-fashioned LED debugging, using an O-scope and some pattern generated at different places in the code on any available GPIO...

Check application notes on aTMELS homepage for correct access to EEPROM.

I have used AVR:S a lot and the JTAGICE is a tool I would not want to be without, definately worth the money it costs.

Hope you find the bug.

BR Jimmy

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Droidling wrote:
bobgardner wrote:
You have to turn off interrupts during eeprom write. Maybe that threw a monkey wrench into the works?

Yikes! I just copied the way other data was written to EEPROM. I’ve been looking and can’t find any place where the interrupts were turned off before writing to EEPROM.

Terry

Well, the sky may not be falling yet. EEPROM writes take several milliseconds/byte; interrupts do not need to be turned off during that entire time. See the datasheet:

Quote:

...The following procedure should be followed
when writing the EEPROM (the order of steps 3 and 4 is not essential):
1. Wait until EEWE becomes zero.
2. Wait until SPMEN in SPMCSR becomes zero.
3. Write new EEPROM address to EEAR (optional).
4. Write new EEPROM data to EEDR (optional).
5. Write a logical one to the EEMWE bit while writing a zero to EEWE in EECR.
6. Within four clock cycles after setting EEMWE, write a logical one to EEWE.
...
Caution: An interrupt between step 5 and step 6 will make the write cycle fail, since the
EEPROM Master Write Enable will time-out. If an interrupt routine accessing the
EEPROM is interrupting another EEPROM access, the EEAR or EEDR Register will be
modified, causing the interrupted EEPROM access to fail. It is recommended to have
the global interrupt flag cleared during the four last steps to avoid these problems. ...

Anyway, remember that EEPROM write cycles are limited. And IIRC you have a serial link in your app--I'd be dumping out the periodic info and capturing it on a PC in a big mother log file for days on end. Then start analyzing when the intermittent sympton(s) finally show up.

Lee

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You have brown out detector on? Murphy says you'll get a brown out while waiting for the eeprom to burn... curtains.....

Imagecraft compiler user

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Perhaps it is possible to increase the speed of the system. If there is a programming error it may show up earlier/more often (at least double the speed for visible effect).

Martin

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Kartman wrote:
In your instance, I think the best course of action is an in depth code review. Hopefully this may identify suspicious points in the design of the code. From what you've described - your bug may be one of the following:
1/ stack overflow
2/ critical variables - have shared variables (between main code and interrupt code) been handled correctly? As in the interrupts being disabled when reading them. These can cause irregular problems that are hard to track down.

3/ program logic.

During the review - identify critical points in the code and add in simple tests to try to narrow down the problem - like turn on a led or invert the led.

As for a debugger - I tend to use the above method!

Since two of you mentioned interrupts I thought I would look at them first. When I started working on this project I was told that CV would automatically shut off the interrupts, store registers, etc. when you enter an interrupt routine. If this isn’t the case I wonder how this code functions at all. Here is an ASM code snippet containing the simplest ISR.

; 2551 // Timer 0 overflow interrupt service routine
; 2552 interrupt [TIM0_OVF] void timer0_ovf_isr(void)
; 2553 {
_timer0_ovf_isr:
ST -Y,R30
ST -Y,R31
IN R30,SREG
ST -Y,R30
; 2554 free_run++;
LDI R30,LOW(1)
LDI R31,HIGH(1)
__ADDWRR 5,6,30,31
; 2555 }
LD R30,Y+
OUT SREG,R30
LD R31,Y+
LD R30,Y+
RETI

I don’t see where the interrupts are being disabled. Is this done in hardware when the interrupt occurs? I will add an #asm("cli") and #asm("sei") to the beginning and end of each C code ISR. Are you saying that I need to do the same before and after each statement that uses a variable changed in an ISR?

It’s looking like I need to go through the I/O and interrupts tutorial.

Thanks.

Terry

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

MartinNL wrote:
Perhaps it is possible to increase the speed of the system. If there is a programming error it may show up earlier/more often (at least double the speed for visible effect).

Martin

It's already running at 16 MHz

Terry

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

pstansfeld wrote:
Hi,

I would strongly recommend a JTAGICE (Mk I or Mk II) for debugging your Mega32. However, it needs a few dedicated IO pins which might have already been allocated in the design.

Good Luck

Pete

Those I/O pins are used as digital outputs. I could disconect them for testing, but I assume I would also have to disable the code that sets them. If only this chip supported the debug wire interface. Is the ICE50 the only other choice for the ATMega32?

Terry

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Triggering an interrupt will clear the I-Flag automatically (hardwired logic), equivalent to a 'cli'. The 'reti' - hardwired, too - will enable global ints again, equivalent to a 'sei'. This behaviour is common to all AVRs.

Andreas

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Terry,
I dare say CV does the right thing with handling of the interrupt routines, but many programmers forget about the problem of shared variables - especially multi-byte variables between the isr code and the main line code. The other killer is testing then modifying variables shared with isr code - the interrupt can jump in between the two operations thus causing bizarre and very random behaviour that can happen very infrequently. These are the nastiest types of bugs!
So look for any variables shared between an isr and the main line code and check to see they're accessed correctly.

main code:

cli();
copy_of_isr_var = isr_var;
sei();
use copy of isr_var.....

or

cli();
do stuff with isr var
sei();

Obviously the idea is to disable the interrupts for the smallest amount of time - so don't go doing printfs etc with the interrupts turned off!!!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Droidling wrote:

Those I/O pins are used as digital outputs. I could disconect them for testing, but I assume I would also have to disable the code that sets them. If only this chip supported the debug wire interface. Is the ICE50 the only other choice for the ATMega32?

Terry

Terry,

It does look like the ICE50 is the only other emulator choice for the Mega32. You might still be able to use the JTAGICE if you moved your digital outputs to other pins. Alternatively you could use a serial I/O port expander IC if the outputs are not too speed critical. Obviously this would need careful modification of the software and could introduce a bug or two but once you've got the JTAGICE working, debugging would be so much quicker and easier.

Pete

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Kartman wrote:
Terry,
I dare say CV does the right thing with handling of the interrupt routines, but many programmers forget about the problem of shared variables - especially multi-byte variables between the isr code and the main line code. The other killer is testing then modifying variables shared with isr code - the interrupt can jump in between the two operations thus causing bizarre and very random behaviour that can happen very infrequently. These are the nastiest types of bugs!
So look for any variables shared between an isr and the main line code and check to see they're accessed correctly.

main code:

cli();
copy_of_isr_var = isr_var;
sei();
use copy of isr_var.....

or

cli();
do stuff with isr var
sei();

Obviously the idea is to disable the interrupts for the smallest amount of time - so don't go doing printfs etc with the interrupts turned off!!!

The consultant that wrote the code is argueing that it is not necessary to switch off interrupts unless you change a variable used in an interrupt routine. We were using this code as an example:

// Global Variables
//
register uint free_run;

void tdelay (uint msec)
{
uint iTemp;
iTemp= free_run + msec;
while (iTemp <= free_run);
while (iTemp >= free_run);
return;
}
// Timer 0 overflow interrupt service routine
interrupt [TIM0_OVF] void timer0_ovf_isr(void)
{
free_run++;
}

He says he wrote this so he wouldn't have to turn off interrupts during a time delay. I'm affraid I can't figure out the flaw in his logic. Is there an ap note or section in the data sheet that explains what happens when you test or get the value of a variable at the same time an ISR is changing it?

I hope this isn't too much trouble. I really would like to know what is happening inside the processor.

Terry

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

If the vars written in the int handlers are 8 bits, then you dont need to 'enter a critcal section' with ints off to read a copy in main. All those vars that get written in the int handlers are 'volatile' to the main program... they can change between any 2 assembler instructions! You must declare tham as such to help the compiler.

Imagecraft compiler user

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
I will add an #asm("cli") and #asm("sei") to the beginning and end of each C code ISR. Are you saying that I need to do the same before and after each statement that uses a variable changed in an ISR?

Terry,
Anytime you share variables across threads or interrupts you must protect them. Especially if they are greater than 8 bits!! Consider trying to read a 16-bit timer/counter that is updated asynchronously. Since the AVR can only read a byte at a time, anytime you do a comparison on the 16 bit number you have to do two read operations to get the entire value into your working registers. If the value is updated between the two reads you will get a bogus number. For example:

counter = 0x00FF;
copy = 0x00??; //your program starts to read it, MSB first....
counter -> 0x0100; //the interrupt occurs and counter increments
copy = 0x0000; //your program reads the LSB and now ahs a bogus value

Every read or write should be an atomic action. But if the variable is ONLY used in one thread or if it is ONLY used in threads that act cooperatively then this will not be an issue. But as soon as an interrupt can read or write a variable that any other piece of code has access to you must mask that interrupt when performing a read or write of the variable.
-Will

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Further to what the other guys have said - it is a waste of timetrying to declare a global variable as 'register' - I don't know of any compiler that will attempt to do this. I think the official 'c' book says register only works withing a subroutine and only if the compiler decides it can do it. Anyway, it is declared as an unsigned int which we assume is a 16 bit variable. With interrupts - remember the cpu only executes one instruction at a time, the interrupt is sampled when a new instruction is to be fetched and if an interrupt is active, the cpu will call that vector. With shared variables - think of it as being in the the middle of something, going off to lunch and your workmate coming into your office and changing things without you knowing. Using cli/sei is like locking your door so that no one else can fiddle. Also, apart from multibyte variables the other killer is the following:

if (shared_interrupt_var)
{
shared_interrupt_var++;
}

even if shared_int_var is a byte, the interrupt can sneak in between the IF test and the ++
this can cause weird stuff to happen - just like you're describing.

This is a common problem with most computers - not specific the the AVR. Even on the PC you have critical sections and mutexes - same thing.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

it is a waste of timetrying to declare a global variable as 'register' - I don't know of any compiler that will attempt to do this.

I cannot agree. CodeVision has well-defined rules for allocating global register variables, either automatically (first declared that fit) or manually ( first "register" variables that fit). I suspect the other compiler brands have similar rules.

Quote:

I think the official 'c' book says register only works withing a subroutine and only if the compiler decides it can do it.

My semi-official C book says, in part: "register declares that the identifier is to be accessed as quiclky as possible. [and] indicates that the identifier should be kept in a machine register. The translator [compiler], however, is not required to do this. It is a hint by the programmer to the translator, in the hope of obtaining more efficient code." AFAIK this follows the actual standard.

CodeVision uses the 'automatic' process for local variables--first declared that fit are implicitly register variables.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There is a neat and tricky way to enregister a global var at the beginning of a subroutine... declare a local var 'register' and load the global at the top of the subroutine, save it back before exit. All references will be to the register copy.

Imagecraft compiler user