Software bug causes AVR hardware error?!?!?

Go To Last Post
18 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I've been programming for around 35 years in total (a huge amount in machine code), and in all that time I haven't come across something quite as odd as what is happening with some assembler AVR code I am working on.

To cut a long story short, there are two sets of code, set A, which works fine, as normal, and then there is set B, which I developed from A, which works fine, UNTIL I remove and re-apply power from the ATMega device a few times, and then it stops working.

I have dumped out the Flash contents both before and after the device stops working, and they are both identical! (I have already made sure the brown-out detector is enabled to prevent any possible FLASH corruption.)

I was always under the impression that a device would start from power up into a known state, so if the Flash is not corrupted, why does the software fail after so many restarts when it worked fine before?

I can't find any registers or hardware which start in an unknown state which would affect the code.

I have started to painstakingly remove and replace sections of code to isolate this bug, which has me worried about the design of the AVR. Not only do I now have to test my code, but after a development phase, repeatedly remove and re-apply power to the device to ensure that it is stable.

Is anyone aware of any known hardware bugs such as the execution of a certain sequence of instructions or maybe interrupt or other hardware glitches which could cause an AVR to enter a confused state which a power off - on would not clear?

If I ever do find the culprit of course I will be delighted to post it on here so others can see if they can subject it to empirical scientific analysis!

In the meantime I will continue to beaver away using a process of elimination - just hope that I don't go mad in the meantime...

Z.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Have you tried diffing B from A? The bad code will be in the diff set somewhere.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

welshzadok wrote:
UNTIL I remove and re-apply power from the ATMega device a few times, and then it stops working.

You need to enable the internal brownout reset or connect an external undervoltage reset circuit.

Peter

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

welshzadok wrote:
I have already made sure the brown-out detector is enabled to prevent any possible FLASH corruption.
Did you enable it correctly? You have to take the AVR clock speed vs. the Vcc level into account. If the Brown Out Detector (BOD) threshold allows Vcc to sag below the minimum voltage required at a given AVR clock speed, then the BOD doesn't provide any protection. Some AVR chips where the highest BOD threshold is lower then the published minimum Vcc (usually only at maximum AVR clock speed) have a note explaining they were tested for this condition and the BOD still works.
welshzadok wrote:
I was always under the impression that a device would start from power up into a known state, so if the Flash is not corrupted, why does the software fail after so many restarts when it worked fine before?
A really slowly rising Vcc on power up might fail to trigger a power on reset in the AVR hardware. You didn't say which ATmega chip you are using. Depending on your chip, either the older MCUCSR or newer MCUSR register will have the reset flags. The data sheet has instructions on using this register. If you check this register at startup it will identify the type of reset. Not getting any reset flag when one is expected means the reset failed. If you observe what appears to be a reset (your program starts over), but you do not get any reset flag you have a software crash.

You could also get unstable results if your AVR clock doesn't have enough delay time in the AVR clock fuse settings for it to stabilize.

Some AVR chips that have a full swing clock option work more reliably in noisy environments when full swing is enabled in the fuse settings.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You might be facing a faulty AVR chip. ATMEL admits that there are some batches of chips produced with faults. I used to have a problem with some code that works at first, and then when single line was changed it behaved pretty wild. Then single line change and code worked again. It took me a while till I found out that even replacing order of two insignificant lines made a difference of working and not working code. Then I transfered code to another board with the same chip but from another batch, and never ever faced that problem again.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

I have dumped out the Flash contents both before and after the device stops working, and they are both identical! (I have already made sure the brown-out detector is enabled to prevent any possible FLASH corruption.)

IME "flash corruption" is rare in an AVR. In fact, I'll go farther and say I've >>never<< seen flash corruption in 10 years of many production AVR apps with total AVRs probably in six figures.

However--I'm not a bootloader person, so there are no SPMs in my code. As with EEPROM "corruption":

>> When you have very heavy noise conditions (where Gnd and Vcc and other levels violate Absolute Maximum Ratings) and/or you get to an unprotected brown-out state (Gnd-Vcc relationship lower than safe operating level for your clock speed and chip spec), then >>anything can happen<<. Your AVR can/will run amok. Perhaps it is just the ones that we can notice or trap but a symptom of EEPROM (or flash) corruption ends up to be reaching the code that does the NVM write.

On the surface, I'd say you have something uninitialized somewhere. Perhaps a GP register or SRAM location that is indeed zero with a dead-cold reset, but under other conditions is not set to a known value upon reset.

Agreed, the first step is to trap (and log or display) the reset cause. (Be sure to clear after grabbing it) I construct an EEPROM buffer for this in these situations so I can get several.

A subtle lockup condition is brown-out, with a separate AVcc supply. If Vcc >>or<< AVcc drops, you get a BOD trip. To unlatch it >>both<< Vcc and Vcc have to drop below the BOD reset level. this can be tested in practice by using a long power down and restart from dead-dead.

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks evryone for your thoughts. I did enable brown out at the maximum voltage, 4.3V, and yes longest reset time on power up.

Quote:
the first step is to trap (and log or display) the reset cause. (Be sure to clear after grabbing it) I construct an EEPROM buffer for this in these situations so I can get several.

Yes, that makes sense, good idea of logging the reset cause into EEPROM.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Which chip are you using and what clock frequency?

John Samperi

Ampertronics Pty. Ltd.

www.ampertronics.com.au

* Electronic Design * Custom Products * Contract Assembly

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
Which chip are you using and what clock frequency?

20 MHz.
Tested fault with ATMega48, 88 and 328, different batches, all exhibit the same behaviour!

It's a bug which I can demonstrate, but as of yet, not find the cause.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Are you using FULL SWING mode for the oscillator? At times the low power osicllator mode can cause "issues" at high frequncies.

John Samperi

Ampertronics Pty. Ltd.

www.ampertronics.com.au

* Electronic Design * Custom Products * Contract Assembly

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

js wrote:
Are you using FULL SWING mode for the oscillator?

The current saving crystal oscillator modes are very sensitive against ripple on the VCC.

Without the full swing mode you need a very stable, clean, noise free VCC.
Naturally all (A)VCC and GND lines must be connected and bybassed with 100nF SMD on every pair.

Also you should always select the longest reset time.
I have seen crystals, which need over 10ms to run stable.

Peter

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

js wrote:
Are you using FULL SWING mode for the oscillator? At times the low power osicllator mode can cause "issues" at high frequncies.

Yes, full swing oscillator, longest reset time.

I have discovered that if I take out a section of code containing an IJMP and table, the bug goes away (at least with the code I am working on at the moment). Whether this is just a red herring or not I don't know yet. I'm posting a new topic on IJMP to see if anyone else has ever run into irregularities.

Z.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Once you use IJMP in Asm or function pointers in C you are in a minefield. It's all too easy for pointers to be corrupted then the code goes off into la-la-land

(I haven't checked but I'm willing to bet MISRA doesn't allow f-ptr's to be used in C)

EDIT: actually MISRA rule 104 says:

Quote:
Misra C:1998 rule 104 reads \"Non-constant pointers to functions shall not be used\"

So I guess a table of f_ptrs fixed in PROGMEM would be OK - but that raises the question of what happens if the index goes out of bounds - but I guess MISRA have that covered elsewhere?

Last Edited: Tue. Apr 6, 2010 - 09:12 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:

Quote:
the first step is to trap (and log or display) the reset cause. (Be sure to clear after grabbing it) I construct an EEPROM buffer for this in these situations so I can get several.

Yes, that makes sense, good idea of logging the reset cause into EEPROM.


What are the results?

You can put lipstick on a pig, but it is still a pig.

I've never met a pig I didn't like, as long as you have some salt and pepper.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Since you run on the maximum frequency of 20MHz, you can check setting the prescaler to 10MHz or 5MHz.

Then, if it was really a CPU fault, it may disappear on lower CPU clock.

Peter

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ahem, **embarrassed face**, it seems that the programmer must leave the AVR registers zeroed, but I checked and my offset register (added to Z) was not initialised. So after a reset / power down or whatever if the offset reg was not zero or something in the range of the table, the code crashed.

I suppose I assumed the AVR CPU regs were zeroed on reset like all the other hardware registers, I/Os and so on, but it seems not...

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
but it seems not...
Definetely not. :) That's why I clear all ram and registers at power up just after setting up the stack.

In case it helps

;Clear all memory
clr_mem:
	ldi		yh,high(SRAM_START)	
	ldi		yl,low(SRAM_START)
	clr		temp
	ldi		temp1,high(ramend-1)	;Leave last 2 addresses,return address
clr_mem_lp:
	st		y+,temp
	cpi		yl,low(ramend-1)	
	cpc		yh,temp1
	brne	clr_mem_lp

;Clear all registers
clr_regs:
	clr		zh
	ldi		zl,0x1d					;Start at YH
	clr		r0
clr_reg_lp:
	st		z,r0					;Clear register
	dec		zl						;Decrement pointer
	brne	clr_reg_lp

John Samperi

Ampertronics Pty. Ltd.

www.ampertronics.com.au

* Electronic Design * Custom Products * Contract Assembly

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I think we are experiencing the same problem. See https://www.avrfreaks.net/index.p... for more details.

The solution posted on that topic worked for the first batch of boards, but for some reason the new batch of boards that we have received freezes when we clear the registers and SRAM. This forces us to look for an alternative solution.