Accidental EEprom Erasure Question

Go To Last Post
13 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ref: xmega128A3U

 

This topic may have been beat to death but I wanted to query the group for some guidance. We have a fielded product (~200 units) where about 5 units came back to the factory with the EEprom completely erased. This is not simple corruption of a few bytes, etc. but all locations blanked (0xFF).  I am aware that it is is important to enable the BOD fuse (which we do) to prevent possible corruption but is it possible for the entire EEprom to become erased if the BOD fuse was not programmed?  I ask because due to the construction of the product, gaining access to the circuitry in order to interrogate the micro for it's fuse settings is not trivial.

 

So, is there any other mechanism for erasing the EEprom other than using the NVM controller?  Has anyone else in the group experienced the same problem?  At the moment we are certainly puzzled as to how this could have happened.

Thanks for your help.

Jim

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

rfdes wrote:

I am aware that it is is important to enable the BOD fuse (which we do) to prevent possible corruption but is it possible for the entire EEprom to become erased if the BOD fuse was not programmed? 

 

It is already a mistake that you have disabled BOD. If you don't correct it, you can't start another verification.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

kabasan wrote:

rfdes wrote:

 

I am aware that it is is important to enable the BOD fuse (which we do) to prevent possible corruption but is it possible for the entire EEprom to become erased if the BOD fuse was not programmed? 

 

It is already a mistake that you have disabled BOD. If you don't correct it, you can't start another verification.

 

I am sorry if I was not clear but our product DOES have the BOD fuse enabled.  My comment was whether, possibly, some of the units left the factory without the BOD fused programmed, so we need to investigate.  My question was if a disable BOD might result in a complete erasing of the EEprom.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm guessing some routine you have for writing EEPROM in the application is running rogue and writing it all. Is the code confidential? 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

clawson wrote:

I'm guessing some routine you have for writing EEPROM in the application is running rogue and writing it all. Is the code confidential? 

 

Hi -

I thought the same thing and will attempt to verify if this is the case.  As far as the code being confidential, yes I am not allowed to share.

I appreciate the advice.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I agree. Sounds like a code problem.

If you don't know my whole story, keep your mouth shut.

If you know my whole story, you're an accomplice. Keep your mouth shut. 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

A static analyzer can produce a strong hint on the source code defect that leads to a program counter fault (runaway, after wading through the static analysis false positives)

In-lieu of :

  • some linters can produce a weak hint that can aid the ones re-doing a source code review
  • an assertion may catch the defect

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

How do you know the entire EEPROM  has been erased?  Did you disassemble a unit to verify?

Do you use a CLI before attempting any EEPROM writes?  This is important (at least with the classic AVR's).

 

where about 5 units came back to the factory with the EEprom completely erased. This is not simple corruption of a few bytes, etc. but all locations blanked (0xFF)

 

Sounds somewhat like they were never programmed at all...is this possible? Did somebody forget?   What do these values do?  Are there default values used if the EEPROM is not set? 

At powerup is a checksum run on the EEPROM to verify it is ok & otherwise use some backup value, or alert the operator, or sound a klaxon?

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

rfdes wrote:
I am aware that it is is important to enable the BOD fuse (which we do) to prevent possible corruption ...
Is the BOD continuous or sampling?

Reason : Inadequate bulk capacitance given the values of BODACT and BODPD (ACT - active, PD - Power Down)

rfdes wrote:
I ask because due to the construction of the product, gaining access to the circuitry in order to interrogate the micro for it's fuse settings is not trivial.

33.11.3.2 Read Fuses

The read fuses command is used to read the fuses from software.

1. Load the NVM ADDR register with the address of the fuse byte to read.

2. Load the NVM CMD register with the read fuses command.

3. Set the CMDEX bit in the NVM CTRLA register. This requires the timed CCP sequence during self-programming.

The result will be available in the NVM DATA0 register. The CPU is halted during the complete execution of the command.

rfdes wrote:
So, is there any other mechanism for erasing the EEprom other than using the NVM controller?
Yes

A method of cracking a MCU is power glitching; typically used to corrupt the program counter.

XMEGA BOD has more than adequate quality though isn't a reference BOD.

An external BOD is more complete (increased speed, reduced current), more precise, and more correct (greater accuracy)

 


ATxmega128A3U - 8-bit AVR Microcontrollers

About Us – NewAE Technology Inc.

...

The ChipWhisperer® system aims to fix this, by releasing an entirely open-source toolchain for performing attacks such as power analysis and clock or power glitching.

...

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

 back to the factory with the EEprom completely erased.

 

Alright, please bear with me for a moment.

It was a long night shift, I'm tired, and I'm not going to verify / look this up.

 

I thought an erased EEPROM is all FF's.

 

If there was rouge SW writing to the EEPROM wouldn't the EEPROM NOT contain all FF's, (one can't over-write an EEPROM location with FF).

 

Perhaps splitting hairs, but wouldn't the rouge SW have to erase the entire EEPROM, not write it?

 

It might make a difference on where one focuses their attention within the code.

 

The EEPROM is accessed, at the chip level, (IIRC), on a page basis.

That means that if the OP has uploaded the EERPROM, and it is all FF's, throughout every page, then

the rogue SW had to issue a page erase for every page that originally had actual data in it.

 

I guess that might not be the case if the NVM controller has an erase entire EEPROM command.

(IIRC that command did not work in the early XMegas, one had to erase the EEPROM page by page, even if it was behind the scenes of an "erase all" command.)

 

I wonder, as avrcandies eluded to, if this is human error based upon not having actually programmed the EEPROM initially?

What is the post built, pre-ship, last test of the software & board?

Would it check for this error, or would the board appear to pass?

 

Anything unique to the 5 failed board's environment?

(Different power supplies, different EMI environment, etc.)

 

Anything unique about the 5 failed boards in terms of hours of operation?

(These 5 boards have/had more hours of use than the other working boards?)

 

Can you trace the final build steps back to a single technician who processed those boards?

(And perhaps had a bad day, was tired, had other things on their mind, ...  )

 

Are there ongoing failures, (e.g. a new one every month)?

 

Do you have boards in house, under as close to actual operating conditions as real, running 24/7?

 

Was it some unique sequence of User inputs, (or sensor inputs), that perhaps triggered the event?

(Why would the User ever do … !!!  It couldn't happen.... Famous last words!)

 

I think there is lots of global analysis to do before one starts dissecting the code.

(Which one might well still need to do, but it is helpful to have a better idea of the chain of events that triggered the failure.)

 

JC

  

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

avrcandies wrote:

Sounds somewhat like they were never programmed at all...is this possible? Did somebody forget?   What do these values do?  Are there default values used if the EEPROM is not set? 

At powerup is a checksum run on the EEPROM to verify it is ok & otherwise use some backup value, or alert the operator, or sound a klaxon?

 

Never getting programmed is certainly possible.  There are default values that get set during initial boot but there are a handful of CAL values that the technician may have forgotten to program.  I've got some detective work to do in an effort to verify this possibility.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks for all the tips to work through as possible reasons for our issue.  If I am successful in tracking down our issue, I will be sure to post my solution.

Take care - Jim

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

gchapman wrote:

rfdes wrote:

So, is there any other mechanism for erasing the EEprom other than using the NVM controller?

Yes

 

A method of cracking a MCU is power glitching; typically used to corrupt the program counter.

XMEGA BOD has more than adequate quality though isn't a reference BOD.

An external BOD is more complete (increased speed, reduced current), more precise, and more correct (greater accuracy)

 

So a TOTAL EEprom erasure with a power glitch is possible?  Has anyone actually verified this possibility?  I could see where a power glitch would corrupt a handful of random EEprom values but not all of the values, maybe possible but statistically improbable.

Thanks for the input.