EEPROM variable keeps getting corrupted

Go To Last Post
24 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi
I have written this code for a friends hot water system which incorporates a solar water heater and water heated radiators.
The system itself is fairly straight forward, there are 3 temperature sensors (DS18B20), 1 in the solar hot water heater and 2 in the hot water cylinder.
My code scans the 1 wire bus and maps the temperature sensors according to whatever the user has previously configuring using the LCD and 2 buttons.
Initially the system works fine, but after a while the Serial number for the TOP sensor (ee_TOP_sensor_Serial_addr) gets set to zero. My friend has indicated that the solar system can go above 125 degrees celsius (which is what the DS18b20 is rated up to), but seems to think that the corruption doesn't necessarily happen when this temperature exceeds the limit of the DS18B20.

Here is my main code for you to peruse at your leisure and give me any pointers as to where there might be a problem.

Main.c file attached. Thanks in advance.
Francois

Attachment(s): 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Without looking at your code, the most likely cause for corruption like this is not having the brown-out detector enabled. Please verify that you have enabled it your fuse settings.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes, I have enabled my brown out detector (BODEN) in the fuses.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Silly follow up question - is BOD set at the right level for Vcc?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Not a silly question at all :-)
I have BODLEVEL=1 so that triggers at 2.7V, Vcc is 5V so do you think I should set BODLEVEL=0?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

sei() + eeprom_write_block()?

I suppose it is much better for all of us to use "corrupted" word with the same meaning as Atmels ANs and datasheets use it. This case has nothing to do with it.

When you set your BOD and do not run the chip over 85*C, then we are looking for a bug, not a corruption.

No RSTDISBL, no fun!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

The end result is still corruption. I started with the most obvious culprit [lack of BOD], as history here indicates.

@fherbert: indeed try the correct setting for a 5V system. In the meantime I will now look deeper by examining the code you posted to see if anything is obviously wrong.

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I suspect the issue may be in my code somewhere, but I've re-read my code many times and can't seem to pinpoint where it would be clearing/overwriting/corrupting (insert your correct word here) the ee_TOP_sensor_Serial_addr eeprom location. Hence my uploading for fresh eyes to check :-)
I will change the fuse setting today to 4V BODLEVEL to see if this makes any difference.
Thanks for all the suggestions so far.

Also for checking, I have added an additional 'dummy' eeprom variable, and even with this dummy variable, the ee_TOP_Sensor_Serial_addr stills gets cleared, so I suspect it may be something in my code at fault.

// Create the sensor serial numbers..
uint8_t EEMEM ee_DUMMY_Sensor_Serial_addr[OW_ROMCODE_SIZE];
uint8_t EEMEM ee_TOP_Sensor_Serial_addr[OW_ROMCODE_SIZE];
uint8_t	EEMEM ee_BOTTOM_Sensor_Serial_addr[OW_ROMCODE_SIZE];
uint8_t	EEMEM ee_SOLAR_Sensor_Serial_addr[OW_ROMCODE_SIZE];

Brutte: I have sei() at the start of my main loop and don't disable interrupts, are you suggesting I need to disable interrupts to write to EEPROM?
The system saves the settings to EEPROM just fine, it's just that after a period of time the EEPROM location for ee_TOP_sensor_Serial_addr gets screwed/deleted/corrupted/reset. And this is without going through the eeprom_write_block routines, this is while the program is in the CHECKING_TEMPERATURE state.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I do not know which AVR you use, they are all similar, but some IO registers have different names. This EEPROM write procedure is from ATMega16, rev.P, p.20:

    1.Wait until EEWE becomes zero. 2.Wait until SPMEN in SPMCR becomes zero.
    3.Write new EEPROM address to EEAR (optional).
    4.Write new EEPROM data to EEDR (optional).
    5.Write a logical one to the EEMWE bit while writing a zero to EEWE in EECR
    6.Within four clock cycles after setting EEMWE, write a logical one to EEWE.

I suppose that from time to time you have ISR call between 5. and 6.

No RSTDISBL, no fun!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I'm using the ATMega8. I assumed the eeprom.h (avr-libc) was taking care of the interrupts, but having a read, it does't appear to. I will disable interrupts while writing eeprom and see how this goes (as well changing the BODLEVEL fuse) today.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0
uint8_t current_state, lcd_menu_timeout_secs, lcd_menu_state;

These variables are accessed by both ISR and main(). They are global and uint8_t, but from this you cannot infer they are volatile and atomic. At least it is not that obvious for me.

No RSTDISBL, no fun!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Brutte, you are correct. They should be volatile as they are changed in both main() and the Interrupt routines. I will update this as well.
Looking at the rest of the variables, I suspect that Sensor_TOP_mapping, Sensor_BOTTOM_Mapping and Sensor_SOLAR_mapping should be defined as volatile as well, as they are accessed/set by setSensorOrder and main().

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I will be very surprised if your problem's root is not in

	  delay_ms(500); // debounce

together with some noise coupling into the button's lines (aren't they connected to the board through some wires?)

Isn't there a sequence of button presses resulting in the described symptoms?

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

No, you do not get it. I guess this is because you program in "C" and not in "asm". If you are not sure about something, look at *.lss

You wrote the code for main(), and three ISRs. From these I can see only main() code can be interrupted. Your ISRs are not interested in mentioned variables, so variables do not need/should not be volatiles. And it really does not matter if the function is inlined in main() or called from there - all of them can (and will) be interrupted, unless you prevent that (when it is necessary).

Think it that way:
Look through the sequence of main() code execution, line by line, stepping into the functions, and imagine some wicked ISR is interrupting your code constantly. It starts as soon as sei() apperars in main. ISR just interrupts, makes some calculations and gets back.

This ISR is so tricky it will trigger just between 5. and 6. point of EEPROM write. But, no, not now, when you are debugging. It will wait until your design is installed and just after Saturday midnight in the middle of freezing winter it triggers preventing eeprom byte write, changing preset value from 20 to -0x4E.

No RSTDISBL, no fun!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I cringe at your use of delay functions in your ISR's. [as well as the calling of lcd_clrscr()] However this is not likely the cause of your problems.

Try bracketing your EEPROM accesses with cli/sei's in order to make them uninterruptable. As interrupting a read/write sequence could cause the problems you see. [moreso the write]

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

glitch wrote:
I cringe at your use of delay functions in your ISR's. [as well as the calling of lcd_clrscr()] However this is not likely the cause of your problems.

the problematic piece of code, as I wrote:

     delay_ms(500); // debounce 


The problem is not in the delay function, but in the comment.

glitch wrote:
Try bracketing your EEPROM accesses with cli/sei's in order to make them uninterruptable. As interrupting a read/write sequence could cause the problems you see. [moreso the write]
This has no justification. There is nothing in the interrupt which would interfere with EEPROM.

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Quote:
This has no justification. There is nothing in the interrupt which would interfere with EEPROM.

Quote:
    1.Wait until EEWE becomes zero. 2.Wait until SPMEN in SPMCR becomes zero.
    3.Write new EEPROM address to EEAR (optional).
    4.Write new EEPROM data to EEDR (optional).
    5.Write a logical one to the EEMWE bit while writing a zero to EEWE in EECR
    6.Within four clock cycles after setting EEMWE, write a logical one to EEWE.

No RSTDISBL, no fun!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi Wek

I'm not sure why a can't use a 500ms delay for software switch debouncing. It's a method I have used before, and in fact, many years ago it was part of our microcontroller course (using software instead of hardware debouncing).
Do you mean it's where I've put the particular delay that may be causing the problem?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

No. I mean, a delay is not debouncing. You are not making sure that the switch has firmly made a contact, which is the primary purpose of the debouncing; you just avoid repeated action upon the bouncing, which should be only a consequence of the former.

Please re-read my previous post and answer the questions in there.

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I will have to check with my friend, but from my understanding, he is able to set the serial numbers fine using the menu and buttons, but after a period of time (with no button presses) the top temperature setting displays '-' on the LCD. It is at this stage when he enters the menu and sees that the top_sensor_mapping is set to 0.

I guess the point you're trying make is that with the way I have implemented my 'switch debouncing' is not able make sure the switch is actually pressed, rather than being triggered by noise.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

wek wrote:

glitch wrote:
Try bracketing your EEPROM accesses with cli/sei's in order to make them uninterruptable. As interrupting a read/write sequence could cause the problems you see. [moreso the write]
This has no justification. There is nothing in the interrupt which would interfere with EEPROM.

JW

See Brutte's response above for the details. The interrupt itself, with or without the delay can be a problem. The delay [or other function calls] do not exasperate the problem. [but they are generally bad for other reasons]

Writing code is like having sex.... make one little mistake, and you're supporting it for life.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Ok, I have done a bit more debugging. The bug occurs when the Solar temperature reaches 100. I have downloaded the EEPROM and this is fine, so it appears that the EEPROM itself does not get corrupted/reset/cleared. However, when entering the menu and checking the serial number for the TOP sensor, it displays 0.
This made me check further into the code and I use a temporary variable to write the solar temperature to :

itoa(current_SOLAR_temperature, temp, 10);

Obviously I assumed that a char[3] would be big enough for this, but it appears the itoa function writes a string which means the temp variable would be written with a null terminator, so in the case of the solar temperature being above 99, the length on the temp variable needs to be at least 4 chars, not 3.

It really helps debugging yourself instead of relying on your friend telling you what is wrong :-)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

fherbert wrote:
It really helps debugging yourself instead of relying on your friend telling you what is wrong :-)

How true. The very first line of the would be "The debugger's guide to human and machine psychology"... :-P

fherbert wrote:
Obviously I assumed that a char[3] would be big enough for this

Oh, I've been completely wrong, and am therefore "very surprised"... ;-) However, I still maintain that not using proper debouncing for the buttons you are asking for troubles in the future.

Glitch, Brutte, the originally described symptoms were "writes to EEPROM inadvertently" while the mechanism you are describing results in "sometimes fails to write to EEPROM". And, the avr-libc eeprom library is written interrupt-safe, please look it up. It would be surprising if it would not be so, wouldn't it.

JW

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

@wek:

Quote:
And, the avr-libc eeprom library is written interrupt-safe, please look it up. It would be surprising if it would not be so, wouldn't it.

You are right, eeprom_write_byte() disables SREG_I so that 5. and 6. steps are atomic. Then this must not have been the cause of the trouble.

@fherbert:
JTAG +

No RSTDISBL, no fun!