Sorry for the "click-bait-ish" title. I'm having a series of random failures on a custom board carrying the ATXMega256A3U and I'm having a devil of a time tracking them down. I'd like some general advice on where to go in troubleshooting. Here's the story (as concise as possible):
- Custom PCB with ATXmega256A3U, XBee 900 Pro HP, I2C to UART bridge, RS232 transceiver chip and 3 switching regulators providing 3V3, 5V0 and 9V0 voltage levels.
- Primarily a data ingestion / fusion board. It reads serial data from several different off-board systems/sensors, fuses them into a single data packet and then transmits it via the XBee.
- XBee is setup in transparent mode and communicates via 115200 baud UART. Data packets are small (few hundred bytes) and have low rep rate (1Hz)
- I have tested 6 boards against 3 different systems. The failures are not consistent with themselves and not consistent against a specific system of sensors (i.e. I don't believe it's a malfunctioning sensor killing boards).
At first, everything works exactly as intended. After some amount of time (we haven't been able to nail down exactly when - sometimes it works continuously for hours, including multiple power cycles, sometimes failures occur after 1 or 2 power cycles!) various problems with the micro crop up. A few examples:
First failure was the pin driving the reset line for the XBee. I had the XBee reset wired to a tactile button for manual reset and also to PA4. I had PA4 configured as a "WiredAND" to protect it in case someone pressed the reset button. No one did, in fact it was installed in such a way that the button can't be pressed even by accident. After a few power cycles, the XBee got stuck in reset. Upon examination, I disconnected the reset button and I determined that PA4 was latched low. When I re-flashed the micro, it pulled it high and it worked... for about 10 minutes and then it seemed like the pullup was disconnected because I could watch the voltage on the pin slowly bleed down to fuzzy and then 0. Here's the really strange part: I changed the PA4 configuration to be a totem pole drive and commanded it high. Everything worked and continued working for months before it suffered a UART problem (see below).
Next failure was on a different board. This time, the problem was with the PORTD UART. The RX pin was idling low instead of idling high. Maybe "idling" isn't the right term - it was actually being pulled low by the micro. There is a 1K resistor in series with the UART pins and the other side (the I2C to UART bridge) was banging out the signal exactly as it should be but all 3.3 volts was being dropped across the series resistor. I replaced the board.
The board with the XBee problem eventually had a similar UART problem. This time it was the PORTC UART. The RX pin seems just fine and idles at 3V3. The TX pin also idles at 3V3 but doesn't transmit (consistently). This one is really weird. After I removed the board from service, I took it to the bench to diagnose. I verified that it wasn't responding to queries over the wireless link and flashed a debug program that just transmits ASCII 'U' to the XBee. I saw the U come through wireless sporadically. I put a scope on it and verified that sometimes it was transmitting and sometimes it would idle high. I also noticed a green LED connected to the power rail would occasionally "stumble" while the micro was transmitting 'U'. It was so infrequent that I wasn't able to capture anything on the power rail with my scope. This intermittent behavior continued for about 15 or 20 minutes as I debugged. Eventually, it stopped all together. I am able to program the micro, set fuses, etc. but it doesn't seem to be executing code directly. I have debug LEDs on PORTR and PORTC. Neither of those are responsive (i.e. I flash a simple program to light them and nothing happens, and yes the pins are properly configured for output, etc - the test program works on a new board just fine).
I ran the system with a scope on the 3V3 rail for about an hour to see if there were any spikes that could be damaging the chip, but nothing...
There have been other failures similar to the above, typically with a UART that just stops responding, followed by the micro crapping out and none of the pins responding.
A little background: I've designed and now produced 500+ boards using various ATXMega chips. This is the first time I've seen these problems. I try to follow best practices in PCB layout and keep my UART/SPI/I2C/PDI traces matched in impedance, even when running at fairly low baud rates. The only difference on this board versus my other ones is that the decoupling caps for the micro are on the bottom of the board while the micro is on the top of the board. I really don't think this has anything to do with it; I know some people get fussy with these but I'm using a fairly large via and it's a 4-layer board with power and ground planes. The decoupling caps are right on top of the pins they feed, just on the reverse side.
Attached is a (somewhat messy) schematic of the micro connections. Not sure how useful it is...