debugging a crash without a hardware debugger - options?

Go To Last Post
10 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I've found a 100% reproducible crash bug in my ATMEGA1284P project, where by "crash" I mean my custom AVR board either locks up, or appears to jump back to the beginning of main() and starts the program anew. It's a timing-sensitive issue while communicating with another device, so debugging in an emulator isn't an option. I'm also using ALL the pins, so connecting a hardware debugger isn't an option either. I do have an LCD display that I can print diagnostic info to, if I can catch an error in some kind of exception handler, maybe even write code to unwind the stack so I can try to see how it got there.

My best guess is that I've got a bug that overwrites RAM, corrupting the stack, and causing the AVR to jump to a bogus address when it pops a return address off the stack. I was hoping to add a handler for some kind of "invalid instruction" exception to help catch this, but from what I can see, the 8-bit AVR has no such concept.

Any good suggestions on ways to go about debugging something like this?

Other possible explanations:

1. Electrical problems, bad voltages, shorts, etc. Possible, but I strongly doubt it. I've been using this board for months, and except for this one case, I've never seen any flakiness.

2. Brown-out or watchdog timer. These are disabled (unless I've done it wrong). And sometimes the AVR doesn't actually reset, it just hangs, which is not what I'd expect from the BOD or watchdog.

3. Some other interrupt firing that I didn't think was enabled, and that there's no handler written for.

4. Heap colliding with the stack. avr-size says I'm using 14032 bytes of RAM in my data segment (of 16K total), there's no dynamic memory allocation, and the stack shouldn't use more than a couple of hundred bytes.

Any ideas or suggestions of things to try will be appreciated. Thanks!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

, but from what I can see, the 8-bit AVR has no such concept.

You are right.

I'd maybe implement a pretty fast ticker interrupt (depending on the sampling rate you want to try for) that simply sniffs SP each time and sees if it gets down to anywhere near .bss end (or the highest thing you compiler places if not GCC).

EDIT sorry I see avr-size reference so bss end is the limit for SP.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Another suggestion as to possible time-dependent problems could be a 16 bit variable that is shared between an ISR and main code.

you pick up the first 8 bits and are interrupted; the ISR changes the 16 bit value and then your main code picks up the other 8 bits and so has inconsistent data.

regards
Greg

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

bigmessowires wrote:

My best guess is that I've got a bug that overwrites RAM, corrupting the stack, and causing the AVR to jump to a bogus address when it pops a return address off the stack. I was hoping to add a handler for some kind of "invalid instruction" exception to help catch this, but from what I can see, the 8-bit AVR has no such concept.

That's not so much an invalid instruction, as a incorrect return address.

If you are not down to your last byte/cycle of code, one easy sanity-protector is to protect all array writes with a modulus on the size of the destination.

In binary sized arrays this should cost just an AND, and then you cannot overflow outside the array.
This keeps any index errors local, where they are less damaging.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

As greg suggests, atomicity issues are a common problem. They tend to appear randomly. The things to look for are shared multibyte variables, read/modify/writes on any sized variable and access to structures or arrays. Eg a time structure where an isr updates the time/date. Since you might have vars like secs/mins/hours etc you need to read/writes these as an atomic operation ie 'as one'. This can mean disabling interrupts for a short time or arranging a mechanism to ensure this. Google 'therac25' for an example of these problems.

Ensure array indexes are within bounds, pointers as well. Avoid using ram based function pointers. Apply MISRA rules to your code. Perform a code review - this would be my first thing and try to identify suspect code in light of the above suggestions.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

regards
Greg

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks for the many great suggestions! I think the atomicity or shared 16-bit variable theories could well be correct - there are several interrupt handlers in the code, operating on shared data. I've fixed a few errors of this sort already, but maybe I overlooked one.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
ways to go about debugging something like this?

Make it run on a bigger chip, so that you can use a debugger, use extra pins, etc. An Arduino MEGA with 2560 could be had at retail brick&Mortar stores for less than $100.

(this has an added possible win-scenario. In the process of making the code general enough to run on a second processor, you may find a "duh!" problem in the existing code.)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

some tests to do.
if it's a reset you can:
make a eeprom counter for each boot.
for how deep the stack gets it's common to make a pattern in all unused (the stack will use it) RAM, so you can see how deep the stack have been (or a pointer have been of).
I have had a max(min stack grow down) variable in my timer interrups, that log the stack pointer. Then I can see if it slowly grow.

Add
make port combination, where a reset make a memory dump, so you can make a semi warm reset.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Oh and check the MCUSR (or whatever it's called) value each time you "restart" then set it back to 0. Either output it to the LCD or log each time to EEPROM and analyze later.