We have an application running on an atmega328pb. It has been in development for several months and we have just started in-house testing it on multiple pieces of hardware. In doing so, we have discovered a bizzare issue that I just can't figure out.
Every X power-on-resets (where X is very variable, so far seen low as 20 or as high as 100), the device will enter a state where the I/O ports do not respond to writes or reads. Some examples:
- A read of ADC5, with a 3.3V signal applied is read as 0 (this result is inferred from the application behaviour, not directly).
- There are two status LEDs that are connected to PORTC from the 5V rail (with appropriate resistors). These are controlled from the application and should be off or flashing, but they are constantly on (the two I/O pins are constantly low).
- There is a debug LED that should be flashing, but the pin is constantly low.
- A temperature sensor attached to ADC6. The application is returning ADC values suggesting that there is ~35V on that pin (there isn't!)
All these features have been tested and work correctly in normal operation (i.e. almost all power-on-resets result in a successful start and sensible behaviour of the application and the hardware).
When this condition occurs, the application code is still running. I can send/receive commands/data to the application, but if that command/data ends up touching an I/O pin, it won't change the state of that pin, or read the pin correctly.
The sections of code that interface with I/O pins are certainly running - or at least, lines of code before/after those lines are executed (for instance, logging commands placed immediately before/after will result in output).
Clearly, the I/O ports are still working to some extent, because both USARTS are sending/receiving data.
I have only seen this behaviour at POR, not when reset from external reset or the watchdog.
- ATMEGA328PB, 16MHz crystal
- Fuses (Low, High, Extended): 0xCE, 0xDE, 0xFF
- Power supply: input is 48VDC, 5V through linear regulator (<-- if you're wondering, yes this gets hot! It's a D2PAK package. Other design constraints meant we couldn't use a switching supply. The MCU is far enough from the regulator that it doesn't get very warm, and the error can occur when the board has not had time to warm up.)
- I am running a bootloader, which is optiboot modified to use RS485
- The application uses 61.6% of program memory, 58.1% of data.
- The watchdog is enabled with a 500ms timeout.
Some additional background which I don't think is particularly relevant, but included for completeness:
- the application takes MODBUS commands over an RS485 bus and controls some LEDs using PWM from the 16-bit timers.
- The second USART is used as a logging output
- I am using the output compare on TimerB in such a way that I am implementing the workarounds described here: https://www.avrfreaks.net/comment/1717946#comment-1717946
I'm truly baffled by this. Are there conditions in which the I/O hardware might enter a state where they become unresponsive, unreliable or otherwise deviate from normal operation?
Since it can take a long time to produce the issue, I'm wary of just testing around it blindly, so I thought I would post early and see if anyone has any advice on the best approach to this.
Right now I'm focusing on producing software builds with additional diagnostics in the hope of getting more information about the MCU state.
Any help, information or advice is (as alwasy) very much appreciated. I hope I've been clear enough about what's happening. Apologies for any omissions, errors or spelling/grammar mistakes. This is causing a bit of stress.
Thanks in advance,
Edit: added clarifying sentence about USARTs working
Edit 2: added extra sentence about reset behaviour