What kind of bug can cause main to restart?

Go To Last Post
13 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have been trying to find a bug in my code, but I'm struggling to find what is going on. Now I was able to find through debugging that the software restarts from the beginning of main all the time (after initialization code, a few interrupts and main loop). The RST.STATUS shows no reset causes (after clearing the flags just after the start of debugging) thus the device is not booting, just the software restarting. Also the counters (TCC0/1) are not reset. I have checked that there is a lot of free space for the stack. There are two timer overflow interrupts (25 and 500 Hz). Sometimes the restart happens right away and sometimes after a few seconds.

 

Now I even get the device to a state where one TCC0 counter doesn't count anymore, although IO view shows all the TCC0 registers unchanged. TCC1 still runs. I also was able to stop the debugging just before I think it would restart. Then the program counter was at  0x63D2 and the disassembly showed "memory out of bounds" or read error". Address on top of the window showed _vector_20 and main shows the cursor at the end of TCC1_OVF_vect.

 

The software is in C, but debugging it spends a lot of time in assembler, which I don't know that well. Thus I can't really follow what is happening. I use AS 6.2 and avrgcc 3.4.1.95, which came with it, but I compile from command line. The device is xmega32a4 on a PCB I have designed, but it has just the device and a regulator. I have used it for testing different things.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Frustrating problem!

 

You already mentioned several of the common causes.

 

Stack overflow is high on the list.

 

An interrupt that doesn't have a pointer to the ISR is another common cause.

 

Recursion can easily overflow the stack.

 

Missing Main Loop can cause this as well.

 

Strings and arrays that are too long for the memory allocated to them can crash the system.

 

Hardware issues can also cause this:

  Vcc and AVcc must both be tied to V+.

  All Grounds tied to ground.

  By-Pass caps, 0.1 uF across each Vcc and AVcc to Ground.

  Power supply with noise or near dead battery

  EMI, (Transmitted (RF), or conducted), can cause erratic operation, but it doesn't could like you have much hooked p to the PCB yet.

 

Can you run a simple flash the LED program without any problems?

 

Can you run with both ISR's disabled, (obviously not a functional program).

Can you disable the ISR's one at a time and see if the system is erratic?

 

Have you run other projects on this PCB without problems?

 

Try first to prove that the hardware, (uC, power supply), work fine, and are stable, with smaller, (trivial) programs running.

 

Then simplify your program and get the core working properly.

 

Then add the interrupts, one at a time, and slowly grow the system.

 

JC

 

Edit:

Forgot to add thermal overload of the power supply regulator to the list...

 

 

 

 

 

 

Last Edited: Fri. Apr 17, 2015 - 12:44 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The device is xmega32a4 on a PCB I have designed, but it has just the device and a regulator

ooops you may have answered you own question wink No caps? Only the regulator and the chip? Are ALL VCC, GND and AVCC pins properly connected?

 

Care to share you circuit diagram?

John Samperi

Ampertronics Pty. Ltd.

www.ampertronics.com.au

* Electronic Design * Custom Products * Contract Assembly

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have been using this very same PCB for testing different things for a 2-3 years. I have an almost identical one with quite a lot of stuff (GLCD, SD card, external ADC, several UARTS etc.) running 24/7 for about four years without a problem.

 

This one is just an almost empty PCB. It has a quite solid ground plane on the other side, all gnd pins connected, all vcc pins have their own 100 nF ceramic and Vcc is stable 3.3 V viewed with a scope. 5 V is provided to the regulator on PCB from the lab power. USB serial is reading the UART TX (no RX).

 

EMI doesn't seem likely, since I can repeat the same pattern of failures by restarting the debugging or also without debugging and just viewing from the scope what the device is doing. Using different break points changes the pattern.

 

I have another version of this software and it ran just fine on the same PCB. Then I got some wierd results, when I made the other version more similar to this failing one. I decided to try this now failing version, which I had marked as a properly working version for this very same PCB a year ago. That may have been with an older avrgcc version.

 

I use a routine, which paints the whole RAM to 0xC5. There is a lot of that still left, when there is a failure. It could still be just a corruption to random address, but not just an overflow.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just checking:  AVcc is also tied to V+?

 

Also, can you measure how much current the PCB is drawing?

Doesn't need to be exact, and I'm not sure how many mA it should draw, (probably less than 2-3 mA).

If it is drawing 25 mA, then that is a problem, (short circuit, output pin tied to ground, lots of flux remaining on the PCB, or under the chip, etc.)

 

JC

 

Edit:  Does a simple program (Led Flasher), run properly, for hours?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Just about any software mistake can cause it to reset. Many hardware mistakes too.

 

Divide and Conquer! Try temporarily disabling one major function at a time. Like comment out a call or something. The program may not work this way, but if it stops resetting, you can suspect, though sadly, not guarantee, that the error lies in the bit of code you're no longer doing. 

 

Or: Start with a fresh, empty project and copy/paste one function at a time into the new project. When it stops working, you've found the problem and can probably fix it in the original project.

The largest known prime number: 282589933-1

In my humble opinion, I'm always right. 

Last Edited: Fri. Apr 17, 2015 - 01:28 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes AVcc is also tied to Vcc and has a 100 nF cap. The PCB is drawing always less than 10 mA, which is the resolution of my lab power. Now it took 5.5 mA measured with a multimeter. That's about what it should be (2 MHz, ADC, DAC).

 

The other software ran OK for hours.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I can remove just one function call and then everything works seemingly OK. The problem is that function call can't be the culprit per se. It is just a very simple subroutine doing pin manupulation on one pin of one port, but it does it by calling other subroutines. Forceup(cycles) is the routine called from the main loop at 25 Hz. If I move all this code in a different position in the file, the error changes considerably. If I just remove the call, all seems to work. If I call just Forceup_2cyc() in place of Forceup(cycles), the program seems to work, BUT nothing is seen on the scope on that pin. If I just move the code inside Forceup_2cyc() to the place its called from, it works just fine.

 

So there must be an error somewhere else? I guess moving code in a file changes memory locations and the code optimization?

void Forceup_512cyc(void)
{
  cli();
  PORTA_DIRSET=PIN3_bm;
  _delay_us(2*510e6/F_CPU);
  PORTA_DIRCLR=PIN3_bm;
  sei();
}

void Forceup_1cyc(void)
{
  cli();
  PORTA_DIRSET=PIN3_bm;
  //PORTA_DIRSET=PIN3_bm;
  //PORTA_DIRSET=PIN3_bm;
  //  PORTA.PIN3CTRL = PORT_OPC_PULLUP_gc; // 2 cycles each
  //asm("nop");
  //asm("nop");
  PORTA_DIRCLR=PIN3_bm;
  sei();  
}

void Forceup_2cyc(void)
{
  cli();
  PORTA_DIRSET=PIN3_bm;
  asm("nop");
  asm("nop");
  PORTA_DIRCLR=PIN3_bm;
  sei();
}

void Forceup_4cyc(void)
{
  cli();
  PORTA_DIRSET=PIN3_bm;
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  PORTA_DIRCLR=PIN3_bm;
  sei();
}

void Forceup_8cyc(void)
{
  cli();
  PORTA_DIRSET=PIN3_bm;
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  PORTA_DIRCLR=PIN3_bm;
  sei();
}

void Forceup_16cyc(void)
{
  cli();
  PORTA_DIRSET=PIN3_bm;
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  asm("nop");
  PORTA_DIRCLR=PIN3_bm;
  sei();
}

void Forceup(uint16_t cycles)
{
  uint16_t i;
  if(cycles>=512)
    {
      i=cycles/512;
      cycles-=i*512;
      for(;i>0;i--)
	Forceup_512cyc();
    }
  if(cycles>=16)
    {
      i=cycles/16;
      cycles-=i*16;
      for(;i>0;i--)
	Forceup_16cyc();
    }
  if(cycles>=8)
    {
      i=cycles/8;
      cycles-=i*8;
      for(;i>0;i--)
	Forceup_8cyc();
    }
  if(cycles>=4)
    {
      i=cycles/4;
      cycles-=i*4;
      for(;i>0;i--)
	Forceup_4cyc();
    }
  if(cycles>=2)
    {
      i=cycles/2;
      cycles-=i*2;
      for(;i>0;i--)
	Forceup_2cyc();
    }
  if(cycles)
    Forceup_1cyc();
}

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I also tried different optimization options. The normal -Os didn't work, -O2 didn't work (but is different) and -O3 seems to work just fine. I can't really see how the optimization would remove the error (unless it is a compiler bug). Probably just puts it to a less harmfull memory area etc.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I looked at the memory of this working -O3 compiled version with the debugger. The painted 0xc5 is uniform from 0x22c3 to 0x2f77. I was able to break the debugging in a spot when the Stack Pointer was 0x2f77. This happens all the time in ISR(USARTC0_DRE_vect), which is a part of the uart code that has been used in several other projects without problems.

 

However there is a region of 0xc5 also at 0x2fbc to ox2fd3.  Is this normal? An area in the middle of stack that has not been changed? Are the local variables here?

 

This chip has 4096 of RAM so the stack starts from 0x2fff?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I don't read C or Asm, and this sounds like a software problem, so I should bow out.

 

But, I'll just say, this could also be a ISR problem where one ISR is interrupting another, and changing a variable that the first ISR or the main code was in the middle of using.

 

Go through the ISRs and look carefully at EVERY variable they use.

If the variable is used anywhere else in the program, then make sure the ISR can't change it when the main program, or another ISR, is changing the value.

 

Make sure that all of the variables that should be declared volatile are declared so.

 

The somewhat random occurrence of the reset sounds like what happens when an ISR just happens to fall at exactly the wrong time.

 

Tracking down just such a bug on a GPS/GLCD project is exactly what made me finally break down and buy a logic analyzer and a DSO a few years ago...

 

These are frustrating bugs, but they can be found!

 

JC

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

All the interrupts are low level, thus they shouldn't interrupt each other as far as I know.

 

I had a problem with an interrupt a few years ago and it turned out to be bug in the WinAVR compiler. That version of avrgcc turned on the interrupts one step too early (OK for standard AVR not for xmega), which resulted to stack pointer corruption, if interrupt happened just at that time. It was a very nasty one, since the software could run for days OK and then some variable would corrupt etc. It also didn't ruin the whole software, since stack pointer returned to correct value after enough subroutines returned. https://www.avrfreaks.net/forum/d...

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I just updated to the latest avrgcc that AS6.2 offered ((from avr-gcc --version: avr-gcc.exe  (AVR_8_bit_GNU_Toolchain_3.4.4_1162) 4.8.1). With this everything seems to work even with -Os. The older compiler was avr-gcc.exe (AVR_8_bit_GNU_Toolchain_3.4.1_830) 4.6.2.

 

So how can I know whether I just got lucky or this was again a comiler bug?