Aaargh debugging!

Go To Last Post
14 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Slowly pulling out my hair here... in general programming because the processor is an ARM STM32L073.

 

I have a widget which talks on two serial ports; on one to a PC via a built in ACM serial USB line; on the other to a mobile phone via a bluetooth link. Neither has any flow control, but the link to the PC has a fairly horrible protocol: it's emulating a single bidirectional line.

  • every outgoing character from the widget is immediately reflected by the PC and discarded by the widget
  • ignoring that echo, a byte actually sent from one end to the other is reflected with its inverse
  • data is sent a packet at a time, with a packet length, packet type, packet count, packet payload (max twelve bytes) and an 0x03 EOTX to finish
  • after initial setup, widget sends a request, computer responds. One packet to computer, one packet from computer.

 

I have debug printfs to the bluetooth terminal at various handy places which can be turned on and off at build time; I have permanent packet display and decoding on the PC

 

tick 37542
rx ram request: 06 23 01 07 20 10 03 
tx ram reply: 0a 24 fe c9 c8 00 00 00 00 80 03
rx ram request: 06 24 01 0a 20 00 03 
tx ram reply: 0d 25 fe 5f ec b9 62 80 64 00 00 19 af 03
tick 37544
rx ram request: 06 25 01 07 20 10 03 
tx ram reply: 0a 26 fe c9 c8 00 00 00 00 80 03
tick 37545
rx error request: 03 26 07 03 
tx error reply: 0d 27 fc 01 00 00 00 09 1c 00 00 00 09 03
rx error request: 03 27 07 03 
tx error reply: 0d 28 fc 00 00 00 00 00 00 00 00 00 00 03
rx error request: 03 28 07 03 
tx error reply: 08 29 fc 00 00 00 00 00 03
tick 37546
rx error request: 03 29 07 03 
tx ack: 03 2a 09 03
rx ram request: 06 2a 01 0a 20 00 03 
tx ram reply: 0d 2b fe 97 ec bc 62 80 64 00 00 1c b8 03

Serial data from either of the ports triggers an interrupt which buffers the data; that buffer is polled to check for/receive any data.

 

The tick count on the list above is a seconds (ish) count from the laptop. As you can see, it's run over ten hours... that's because the widget end is displaying every character received or sent (including the discarded echoes). If I stop displaying every character it will run for some indeterminate time - minutes to a couple of hours and then lock up. The fail condition looks as if the PC has sent a reply packet - either ram or error - but the widget doesn't see it.

 

I'm obviously suspicious of the interrupts on the serial port - one thought in particular was that perhaps the USB packeting was sending exactly the same size data squirt as the buffer, but increasing the buffer size and making it not be a power of two had no apparent change in behaviour. One curious thing is that timeouts in the receive character routine don't timeout - almost as if that routine simply isn't running.

 

Meh, it's doing my head in. I'm not really expecting a solution - apart from anything else, the code is way too big to post - but it'd be nice to hear if I've missed something obvious :)

 

/rant

 

Neil

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

minutes to a couple of hours and then lock up.

Who is locking up?  The STM? Or the PC or PHONE?  If it is the STM, is it completely locked/crashed? ---so if an STM led is also blinking does that stop as well? 

 

Are all of your buffer pointers well-behaved in terms of wrap around handling?  what happens during an attempted buffer overfill---rejection?  crash?  overwrite? delay? 

Could you get in a situation where each is waiting for the other (deadlock)?  A is waiting for B to respond before proceeding, but B is waiting for A to respond before proceeding

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It's the STM that's locking up, but I don't know how far. It does it so irregularly it's difficult to check... when it stops talking, the widget has asked for a packet, the PC has supplied it, but the widget doesn't think it's got it.

 

But I just realised, if the PC is logging that it's gone, then it *is* still talking, and the widget is sending the inverted echoes back... so something is perhaps not happy with the logic in the widget after it completes the reception. Will investigate further...

 

(Because of the every-byte handshake, both ends have to be working to transfer a frame.) The frame is displayed as the last thing before the return on the txframe and rxframe routines.

 

I do have a happy light, but at the moment it's set to toggle every time there's a serial interrupt.

 

Neil

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

barnacle wrote:
If I stop displaying every character it will run for some indeterminate time - minutes to a couple of hours and then lock up.
stack overflow, buffer overflow, off-by-1 that initiates an infinite loop, <many>

Arm Cortex can detect stack overflow.

That MCU has an MPU (detect region clobbering)

barnacle wrote:
... the code is way too big to post - but it'd be nice to hear if I've missed something obvious :)
Try a few or several linters on that code; sometimes the defect will be indirectly indicated by a linter message (C has ambiguities that may or may not be an issue dependent on the compiler)

Sprinkle the code with assertions; you'll find the defect.

 


STM32L073RZ - Ultra-low-power Arm Cortex-M0+ MCU with 192 Kbytes of Flash memory, 32 MHz CPU, USB, LCD - STMicroelectronics

New PIC24F MCUs Feature Low-power Animated Display Driver for Battery-powered Devices | Microchip Technology | Microchip Technology

Are We Shooting Ourselves in the Foot with Stack Overflow? « State Space

[1/4 page]

Designing an Exception Handler for Stack Overflow

 

Adding Automatic Debugging to Firmware for Embedded Systems

 

edit :

The ones at USA NASA JPL recommend assertions :

codeql/UseOfAssertionsDensity.ql at main · github/codeql · GitHub

 

"Dare to be naïve." - Buckminster Fuller

Last Edited: Wed. Apr 7, 2021 - 08:24 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

barnacle wrote:
It's the STM that's locking up

So, once it has locked up, what do you see in the debugger?

 

Is it in a Hard Fault, or something?

 

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

I'm sure that the people at ST would be able to sort this out pretty quickly.

John Samperi

Ampertronics Pty. Ltd.

https://www.ampertronics.com.au

* Electronic Design * Custom Products * Contract Assembly

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

can it be you used the same buffer for two sides of reception?

so you receive word from the PC and diectly after that from the widget and that thus the widget overwrites the buffer from the PC hence the PC has send the data, technically the ARM has received the data, but before it could get processed it has been overwritten by the widget so there is no valid data in the buffer for the PC side decoding......

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

No, they're separate buffers: they're only used for incoming data. Outgoing data is blocking, except for interrupts. I do think though that it's a timing thing... a race condition perhaps.

 

But this morning I managed to catch a glitch: the response message from the PC was not correctly treated somewhere in the chain. So if the PC flagged it as an error, the widget probably did too. Though it should have reset in that case... Investigations continue!

 

Neil

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Indeed... but the question is: is it still multithreaded if the only thread change is via an external interrupt?

 

Neil

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes

CON43-C. Do not allow data races in multithreaded code - SEI CERT C Coding Standard - Confluence

[mid-page]

Noncompliant Code Example (Volatile)

The data race can be disabled by declaring the data to be volatile, because the volatile keyword forces the compiler to not produce two reads of the data. ...

volatile is commonly recommended

[TUT] Newbie's Guide to AVR Interrupts | AVR Freaks

Part 5: Things you Need to Know

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well I've finally managed to catch a trace:

 

 

That last transfer should be [0a] (f5) [f5]... and in similar fashion to the line above, though not as many entries. But certainly not just one response...

 

The widget has returned the wrong value, and that's upset the PC emulator, and that's stopped sending data after one byte, which stops the widget mid-receive. Now to find out why :) Emulator behaviour is as expected.

 

Neil (and yes, volatiles are possible.)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

barnacle wrote:
is it still multithreaded if the only thread change is via an external interrupt?

Whether or not you label it "multithreaded", the issues that can arise are very similar (if not the same?)

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I would look at any vars that are modified in an isr, and make sure they are interrupt protected if accessed by other code. For example, a buffer may have some vars to keep track of its state like head/tail/count, where head and tail are only accessed by a single piece of code and count is accessed by more than one (at least one of them being an isr). In that case, count would have to be interrupt protected when non-isr code modifies the value. With nested interrupts, you would also have to consider whether another isr of a higher priority also accesses the same var. The 32bit mcu means your atomic problems are a little less than the 8bit mcu, but the same atomic problems still exists and you get to make up for the difference with the addition of nested interrupts/priorities.

 

Maybe not a cause of this problem and doesn't really explain the 'lockup', but atomic problems reveal themselves infrequently simply because of the probabilities they are up against. Change code in some way, and your probabilities may also change- if for the better it may become so rare that you do not know it exists, if for the worse then you get something elusive but you know it exists.

 

Anyway, I would at least get settled in my mind that atomic things are squared away so you can check off that box.