SAMD21J18A fails after weeks

Go To Last Post
9 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi,
I am working on a project with ATSAMD21J18A which is in a beta testing state.
Everything worked fine for about four weeks but then there was a very strange failure resulting in a “broken” chip. Since more units have to be build and shipped soon, I’m getting a bit nervous and I’m hoping that someone can help me on this.

I figured that DFLL fails to init properly on the “broken” MCU in SYSTEM_CLOCK_DFLL_LOOP_MODE_CLOSED.

In production code, DFLL is configured to 48MHz with an external XOSC32K crystal. For fault finding I also configured it with the internal OSC32K and OSC8M without any improvement.

I have found two errata that might be related to my issue:
For errata 10669 I have found two different workarounds. For the last one from 2019 I do not know how to manage the workaround.

 

Data sheet from 2016:
1 – The DFLL clock must be requested before being configured
otherwise a write access to a DFLL register can freeze the device.
Errata reference: 9905
Fix/Workaround:
Write a zero to the DFLL ONDEMAND bit in the DFLLCTRL register before
configuring the DFLL module.

2 – If the DFLL48M reaches the maximum or minimum COARSE or FINE
calibration values during the locking sequence, an out of bounds
interrupt will be generated. These interrupts will be generated even if
the final calibration values at DFLL48M lock are not at maximum or
minimum, and might therefore be false out of bounds interrupts.
Errata reference: 10669
Fix/Workaround:
Check that the lockbits: DFLLLCKC and DFLLLCKF in the SYSCTRL
Interrupt Flag Status and Clear register (INTFLAG) are both set before
enabling the DFLLOOB interrupt

 

Microchip errata document from 2019:
1.2.2 False Out of Bound Interrupt
If the DFLL48M reaches the maximum or minimum COARSE or FINE calibration values during the locking sequence, an out of bounds interrupt will be generated. These interrupts will be generated even if the final calibration values at DFLL48M lock are not at maximum or minimum, and therefore, may be false out of bounds interrupts.
Workaround
Enable the DFLL Out Of Bounds (DFLLOOB) interrupt when configuring the DFLL in closed loop mode. In the DFLLOOB ISR verify the COARSE and FINE calibration bits and process as needed.

 

System behaviour:
To understand/find the failure, I’m using the ASF USB bootloader with an character LCD instead of USART. For the LCD delay_XX() is required. On the “broken” MCU delay_XX() is not working properly with the standard clock setup from the ASF project. So the connected character LCD will not be properly initialized on the broken MCU or it will take very long. Toggling a pin with delay_XX() does not work as expected.  The USB communication is not working as well.
With a “fresh” MCU everything is working as expected.

 

When I am using Workaround 1 from 2016 by writing

SYSCTRL->DFLLCTRL.bit.ONDEMAND = false;

in front of system_init() or using DFLL in SYSTEM_CLOCK_DFLL_LOOP_MODE_CLOSED in combination with “systick” delay instead of “cycle”, the LCD is initialized properly. The USB communication is still not working. (“systick” delay is initialized with the current clock frequency, while “cycle” does not require initialization)

 

Does someone understand the behaviour and/or can explain to me how to make the workaround 1.2.2 (2019 "In the DFLLOOB ISR verify the COARSE and FINE calibration bits and process as needed")?

Is there anything else that can cause the described behaviour?

Do you need more information/code?

 

Thanks
Tobi

Last Edited: Thu. Mar 12, 2020 - 01:24 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

When I try to use workaround 2 from 2016 by adding

SYSCTRL->INTFLAG.reg |= (SYSCTRL_INTFLAG_DFLLLCKF | SYSCTRL_INTFLAG_DFLLLCKC);

before

system_clock_source_dfll_set_config(&dfll_conf);

in clock.c

 

nothing changes.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

‘Does not work as expected’ - what was your expectation and what was the observed result?

You seem to suggest that the actual clock frequency isn’t what it should be and this occurs after 4 weeks of operation. What happens when the power is cycled? Does it take another 4 weeks to fail?

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi Kartman,

 

Thank you for your reply.

 

Kartman wrote:
What happens when the power is cycled? Does it take another 4 weeks to fail?

Something in the "broken" MCU has permanently changed. The malfunction appears at every power cycle.

 

Kartman wrote:
‘Does not work as expected’ - what was your expectation and what was the observed result?

 

I spent some more time on this and I figured that with "cycle" delay_ms() the delay works for a delay of up to 5ms. With a delay of e.g. 10ms the program gets stuck.

When debugging with ICE the delay gets stuck in

 * \brief Default interrupt handler for unused IRQs.
 */
void Dummy_Handler(void)
{
        while (1) {
        }
}

in file startup_saamd21.c

 

This only happens at the "broken" MCU. With this very simple code:
GCLK generator 0 configured to 48MHz DFLL closed loop, delay as "cycle"

int main (void)
{
	system_init();
	struct port_config pin_conf;
	port_get_config_defaults(&pin_conf);
	pin_conf.direction  = PORT_PIN_DIR_OUTPUT;
	port_pin_set_config(PIN_PB09, &pin_conf);
	while(1)
	{
		port_pin_toggle_output_level(PIN_PB09);
		delay_ms(10);
	}
}

On a "fresh" MCU everything works fine for both debug and release.

 

The dummy ISR handler is also reached during debug session in some cases (not always) when configuring a TC or TCC (in other code of course). This only happens on the broken MCU.

 

I have figured that the clock frequencies are fine. (Measured at output pin)

So it does not actually seem to be a DFLL related problem anymore.
Is it better to open a new thread then?

 

Thanks

Tobi

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi again,

 

Can someone tell me how to find out the reason for the "unhandled IRQ" which is handeld by the dummy handler?

 * \brief Default interrupt handler for unused IRQs.
 */
void Dummy_Handler(void)
{
        while (1) {
        }
}

Just to make this clear, for me it is not important to get this broken chip back up running. But it is quite important to me to understand the reason for the failure to check if it can be avoided in the future.

 

Thanks

Tobi

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Have you checked for a hardware problem? 

 

As for finding what caused the exception - read the NVIC registers. You can also look at the stack to see when in the code the interrupt/exception occurred. Go over to arm.com and read the processor docs.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi Kartman,

 

I have checked for a hardware problem as far as I can. I did not move the MCU to another PCB jet.
I would guess that it is an chip internal hardware problem since the identical code/compile +fusesettings is not working on the broken MCU any more but working on other MCUs.
I checked the firmware for frequently NVM writes/commints. (-->OK)
In the current firmware the NVM (eep emulator), a page is committed after every write. I will change this to a commit in BOD33 ISR.

 

I also checked:
-Supply voltage is fine at 3V3 (Checked also during operation with oscilloscope)
-XOSC32, DFLL, OSC8M frequencies and stability are fine

 

Some pins are connected to external sensors via cables without special protection. (I have built around 2500 units with ATXMEGAs with similar HW layout without any weird/EMC/ESD MCU-failures.)

 

Can you suggest any additional HW checks and/or reasons for the behaviour?
I will check if I can gain any information by reading the NVIC registers in Dummy_Handler().

 

Thanks
Tobi

Last Edited: Thu. Mar 12, 2020 - 01:16 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Tobi321 wrote:

Hi again,

 

Can someone tell me how to find out the reason for the "unhandled IRQ" which is handeld by the dummy handler?

 * \brief Default interrupt handler for unused IRQs.
 */
void Dummy_Handler(void)
{
        while (1) {
        }
}

Just to make this clear, for me it is not important to get this broken chip back up running. But it is quite important to me to understand the reason for the failure to check if it can be avoided in the future.

 

Thanks

Tobi

 

If you look at the generated file startup_samxxx.c when xxx is the CPU type (mine is startup_saml22.c for my SAML22 CPU) you will see the interrupt vector table.  The table is filled out with the name of each of the interrupts, and each uses Dummy_Handler" as the default.  The function assignment is "weak" which means if you write a function with the same name, it will use your function rather that the default.

 

So to find out how the program called Dummy_Handler, you need to replace all the functions in the vector table with your own code.  For example

 

void  NVMCTRL_Handler(void) { printf("NVMCTRL"); }

 

will override the dummy_handler for NVMCTRKL with a funtion that will print the irq source in the terminal.

 

 

 

 

John Malaugh

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi John,

 

Thank you so much for this detailed and helpful answer!
I'm a bit embarrassed that I did not find out myself. ;)

 

The called IRQ is HardFault_Handler(). It is reached from delay_us(20);

 

I will now go back to class with my Cortex-M0 book to gain more information and understanding the stack. Obviously I’m not very experienced with structured debugging. ;)

If there are any further explanations or suggestions, they are highly appreciated!

 

Thanks
Tobi

Last Edited: Thu. Mar 12, 2020 - 05:03 PM