Best way to log MCU state?

Go To Last Post
13 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have a two board [AVR] system, where it is possible (though unlikely) that one of the boards may reset unexpectedly - brownout, watchdog, stray neutrino (i.e., SEU), etc.

I need to be able to recover from one of these and try to keep the overall system functional. I.e., the state of one board is related to the state of the other. Note, the two boards can communicate to each other via wireless link.

I am familiar w/ how to use the MCUSR register to nominally determine the source of a reset. My question is how best to revert to the previous state upon recognizing source of reset. As I see it, I have two options.

1. Log the current state to a fixed address in EEPROM. At power up, if not detected as a power on reset, check logged state, and revert to that state.

Pros: fairly foolproof, self contained
Cons: Requires relatively frequent updates to a particular address in EEPROM.

Altered approach: have the address of the last known state be pseudo-randomly (or deterministically, i.e., incrementally) dynamically determined at run time. Then log the state-address in a special section of EEPROM for personality data. Now this address will be written to at every power up, but at least it is only once per power up (state will change multiple times.). The randomly selected address containing state data will be written to probably five to ten times per power up, but given that I have 4kB of EEPROM, it will take close to 4000 power cycles before I revisit that address. If each address is written to 10 times, and has an endurance of 100,000 writes, that means the EEPROM technically would survive 40,000,000 power cycles ((100,000 / 10) * 4000). Obviously, the smaller limitation would be the address in the personality space containing the location of the state data -- 100,000 power cycles.

Option 2: Have each board track the other board's state. After a [non-power up] reset, query the other board to determine what state the reset board should revert to.

Pros: no non-volatile memory limitations
Cons: Won't work if both boards reset simultaneously. Requires both boards to be awake. In my particular application, one of the boards is almost always asleep - to conserve power. It would introduce an additional power drain to wake up periodically (i.e., 100ms every 1s) to listen for potential state queries

I am looking for feedback regarding these two options, or for another option that I have not considered. Which seems better?

Science is not consensus. Science is numbers.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

In some conditions (watchdog but not brownout) you can rely on the RAM holding its previous contents so you could consider putting the current state there. However if it has a brownout anything could have happened so I guess you would need a different strategy for that.

How often do these state changes occurs (microsecond, millisecond, centisecond, deciseconds, seconds, minutes, hours, days, months, years, millenia?) for all but the first few I would have thought you could use "wear levelling" over a number of EEPROM locations and not worry about the 100,000 cycle limit and still provide an acceptable life time.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The state machine variable changes fairly frequently, but the "critical" state of the system changes infrequently -- probably an average on the order of hours or days. "Wear leveling" is what I was trying to describe in my alternate scenario for option 1 above. However, it still necessitates writing to the record of the state data location every time. I suppose I could pre-select 10 (or 20, or any number) addresses. When they are used, they contain data. When they are not, they contain some dummy value (i.e., 0xFF). At reset, the chip would cycle through the pre-determined addresses until it gets a non-0xFF value. It then reads its former state, and logs it to hte next address in the list. Each power up will cycle through the list.

Is there a better, or more accepted way of wear leveling?

Science is not consensus. Science is numbers.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Using the erased state of EEPROM (0xFF) as a "not used yet" marker is fairly common practice for simple wear levelling. However also Google the term as a LOT of research work has gone into good strategies as many Nand/Nor filing systems rely on it.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

hobbss wrote:
...probably an average on the order of hours or days.

IIRC the EEPROM is rated to 100k write cycles. That's a write every hour of every day for 11.4 years. Why do you need to worry about wear levelling?

#1 This forum helps those that help themselves

#2 All grounds are not created equal

#3 How have you proved that your chip is running at xxMHz?

#4 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand." - Heater's ex-boss

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I have learned never to make assumptions about "normal" usage with these particular end-users.

Also, during development, there will be much more frequent power ups, resets, and state changes. I would rather not thrash out the EEPROM on my development boards

Science is not consensus. Science is numbers.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

How about an SPI FRAM? 1 Trillion write cycles. That should do you.

#1 This forum helps those that help themselves

#2 All grounds are not created equal

#3 How have you proved that your chip is running at xxMHz?

#4 "If you think you need floating point to solve the problem then you don't understand the problem. If you really do need floating point then you have a problem you do not understand." - Heater's ex-boss

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

That would work great. I have used Everspin products in the past for something like that.

Unfortunately, this is a "finished" design - I am stuck with only the NVM on the AVR.

Science is not consensus. Science is numbers.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

In this situation I maintain data in noinit RAM and then check the cause of the restart.

the brownout case noted above would make this approach unsuitable.

question.... what sort of resets are you getting. eg is it always a watchdog or can it be a range of causes?

regards
Greg

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

You could just have two identical processors with the same code, synchronized to each other. Both controllers receive inputs and both have outputs which go to a hardware COMPARE unit (implemented with a FPGA), which normally allows the EXECUTIVE mpu to output its output to the outside world.
The COMPARE unit looks for differences in outputs between the MAIN and STANDBYE MPU along with other indicators about whether MAIN or STANDBY is out of kilter.
A decision is taken which should be EXECUTIVE.

It helps if all code runs as a Finite State Machine, so that when a ERROR is detected a full state & data dump can be done to determine what caused the problem.

No need to do anything with EEPROM which potentially will causes it's own issues in the end.

This really bullet proofs hardware. If you want to bullet proof software, the MPU's should run code written by two separate programming teams (perhaps one written in C & the other in ASM..hi hi!).
If you want a bullet proof system, you have to pay for it.

Charles Darwin, Lord Kelvin & Murphy are always lurking about!
Lee -.-
Riddle me this...How did the serpent move around before the fall?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Greg -- I'm not getting any resets now. However, I am testing in the benign environment of my lab, and am only running tests for at most a few days (for now). I want to make the system robust enough to operate unattended for weeks.

Lee -- a dual channel system would be a cadillac solution to this problem. I used to be on a team that designed turbine controllers, and that was the exact method we used. However, this application is not safety critical, and size/power constraints are actually a higher priority than absolutely fail-proof hardware/software. While a system failure would be unfortunate, it is still more important that it fits in the defined physical envelope and lasts the specified time on battery power. I am looking for a solution that increases system robustness without any significant hardware changes. At this time, it looks like EEPROM logging with wear leveling is my best option.

Science is not consensus. Science is numbers.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
In some conditions (watchdog but not brownout) you can rely on the RAM holding its previous contents

AFAIK none of the brownout levels changes SRAM. You need (at least) a POR for that*.

I also suggest that you could use a .noinit for storing your ~persistent data.

*I do not remember which AN applies..

No RSTDISBL, no fun!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
Greg -- I'm not getting any resets now. However, I am testing in the benign environment of my lab, and am only running tests for at most a few days (for now). I want to make the system robust enough to operate unattended for weeks.

Have a look at the noinit part of http://www.nongnu.org/avr-libc/user-manual/mem_sections.html I suspect that this will give you a solution.

in case of concerns about data integrity you could apply a checksum to the saved state.

I save a state and then let the watchdog timeout for unrecoverable errors. Reset causes (type of reset + some application data) is saved in a circular log in noinit so that I can recover the last n causes of reset. if the reset is a Power-on-reset then I clear the log.

regards
Greg