Aaargh debugging!

Go To Last Post
28 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Slowly pulling out my hair here... in general programming because the processor is an ARM STM32L073.

 

I have a widget which talks on two serial ports; on one to a PC via a built in ACM serial USB line; on the other to a mobile phone via a bluetooth link. Neither has any flow control, but the link to the PC has a fairly horrible protocol: it's emulating a single bidirectional line.

  • every outgoing character from the widget is immediately reflected by the PC and discarded by the widget
  • ignoring that echo, a byte actually sent from one end to the other is reflected with its inverse
  • data is sent a packet at a time, with a packet length, packet type, packet count, packet payload (max twelve bytes) and an 0x03 EOTX to finish
  • after initial setup, widget sends a request, computer responds. One packet to computer, one packet from computer.

 

I have debug printfs to the bluetooth terminal at various handy places which can be turned on and off at build time; I have permanent packet display and decoding on the PC

 

tick 37542
rx ram request: 06 23 01 07 20 10 03 
tx ram reply: 0a 24 fe c9 c8 00 00 00 00 80 03
rx ram request: 06 24 01 0a 20 00 03 
tx ram reply: 0d 25 fe 5f ec b9 62 80 64 00 00 19 af 03
tick 37544
rx ram request: 06 25 01 07 20 10 03 
tx ram reply: 0a 26 fe c9 c8 00 00 00 00 80 03
tick 37545
rx error request: 03 26 07 03 
tx error reply: 0d 27 fc 01 00 00 00 09 1c 00 00 00 09 03
rx error request: 03 27 07 03 
tx error reply: 0d 28 fc 00 00 00 00 00 00 00 00 00 00 03
rx error request: 03 28 07 03 
tx error reply: 08 29 fc 00 00 00 00 00 03
tick 37546
rx error request: 03 29 07 03 
tx ack: 03 2a 09 03
rx ram request: 06 2a 01 0a 20 00 03 
tx ram reply: 0d 2b fe 97 ec bc 62 80 64 00 00 1c b8 03

Serial data from either of the ports triggers an interrupt which buffers the data; that buffer is polled to check for/receive any data.

 

The tick count on the list above is a seconds (ish) count from the laptop. As you can see, it's run over ten hours... that's because the widget end is displaying every character received or sent (including the discarded echoes). If I stop displaying every character it will run for some indeterminate time - minutes to a couple of hours and then lock up. The fail condition looks as if the PC has sent a reply packet - either ram or error - but the widget doesn't see it.

 

I'm obviously suspicious of the interrupts on the serial port - one thought in particular was that perhaps the USB packeting was sending exactly the same size data squirt as the buffer, but increasing the buffer size and making it not be a power of two had no apparent change in behaviour. One curious thing is that timeouts in the receive character routine don't timeout - almost as if that routine simply isn't running.

 

Meh, it's doing my head in. I'm not really expecting a solution - apart from anything else, the code is way too big to post - but it'd be nice to hear if I've missed something obvious :)

 

/rant

 

Neil

This topic has a solution.
Last Edited: Wed. May 5, 2021 - 02:41 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

minutes to a couple of hours and then lock up.

Who is locking up?  The STM? Or the PC or PHONE?  If it is the STM, is it completely locked/crashed? ---so if an STM led is also blinking does that stop as well? 

 

Are all of your buffer pointers well-behaved in terms of wrap around handling?  what happens during an attempted buffer overfill---rejection?  crash?  overwrite? delay? 

Could you get in a situation where each is waiting for the other (deadlock)?  A is waiting for B to respond before proceeding, but B is waiting for A to respond before proceeding

When in the dark remember-the future looks brighter than ever.   I look forward to being able to predict the future!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It's the STM that's locking up, but I don't know how far. It does it so irregularly it's difficult to check... when it stops talking, the widget has asked for a packet, the PC has supplied it, but the widget doesn't think it's got it.

 

But I just realised, if the PC is logging that it's gone, then it *is* still talking, and the widget is sending the inverted echoes back... so something is perhaps not happy with the logic in the widget after it completes the reception. Will investigate further...

 

(Because of the every-byte handshake, both ends have to be working to transfer a frame.) The frame is displayed as the last thing before the return on the txframe and rxframe routines.

 

I do have a happy light, but at the moment it's set to toggle every time there's a serial interrupt.

 

Neil

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

barnacle wrote:
If I stop displaying every character it will run for some indeterminate time - minutes to a couple of hours and then lock up.
stack overflow, buffer overflow, off-by-1 that initiates an infinite loop, <many>

Arm Cortex can detect stack overflow.

That MCU has an MPU (detect region clobbering)

barnacle wrote:
... the code is way too big to post - but it'd be nice to hear if I've missed something obvious :)
Try a few or several linters on that code; sometimes the defect will be indirectly indicated by a linter message (C has ambiguities that may or may not be an issue dependent on the compiler)

Sprinkle the code with assertions; you'll find the defect.

 


STM32L073RZ - Ultra-low-power Arm Cortex-M0+ MCU with 192 Kbytes of Flash memory, 32 MHz CPU, USB, LCD - STMicroelectronics

New PIC24F MCUs Feature Low-power Animated Display Driver for Battery-powered Devices | Microchip Technology | Microchip Technology

Are We Shooting Ourselves in the Foot with Stack Overflow? « State Space

[1/4 page]

Designing an Exception Handler for Stack Overflow

 

Adding Automatic Debugging to Firmware for Embedded Systems

 

edit :

The ones at USA NASA JPL recommend assertions :

codeql/UseOfAssertionsDensity.ql at main · github/codeql · GitHub

 

"Dare to be naïve." - Buckminster Fuller

Last Edited: Wed. Apr 7, 2021 - 08:24 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

barnacle wrote:
It's the STM that's locking up

So, once it has locked up, what do you see in the debugger?

 

Is it in a Hard Fault, or something?

 

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

I'm sure that the people at ST would be able to sort this out pretty quickly.

John Samperi

Ampertronics Pty. Ltd.

https://www.ampertronics.com.au

* Electronic Design * Custom Products * Contract Assembly

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

can it be you used the same buffer for two sides of reception?

so you receive word from the PC and diectly after that from the widget and that thus the widget overwrites the buffer from the PC hence the PC has send the data, technically the ARM has received the data, but before it could get processed it has been overwritten by the widget so there is no valid data in the buffer for the PC side decoding......

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

No, they're separate buffers: they're only used for incoming data. Outgoing data is blocking, except for interrupts. I do think though that it's a timing thing... a race condition perhaps.

 

But this morning I managed to catch a glitch: the response message from the PC was not correctly treated somewhere in the chain. So if the PC flagged it as an error, the widget probably did too. Though it should have reset in that case... Investigations continue!

 

Neil

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Indeed... but the question is: is it still multithreaded if the only thread change is via an external interrupt?

 

Neil

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Yes

CON43-C. Do not allow data races in multithreaded code - SEI CERT C Coding Standard - Confluence

[mid-page]

Noncompliant Code Example (Volatile)

The data race can be disabled by declaring the data to be volatile, because the volatile keyword forces the compiler to not produce two reads of the data. ...

volatile is commonly recommended

[TUT] Newbie's Guide to AVR Interrupts | AVR Freaks

Part 5: Things you Need to Know

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well I've finally managed to catch a trace:

 

 

That last transfer should be [0a] (f5) [f5]... and in similar fashion to the line above, though not as many entries. But certainly not just one response...

 

The widget has returned the wrong value, and that's upset the PC emulator, and that's stopped sending data after one byte, which stops the widget mid-receive. Now to find out why :) Emulator behaviour is as expected.

 

Neil (and yes, volatiles are possible.)

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

barnacle wrote:
is it still multithreaded if the only thread change is via an external interrupt?

Whether or not you label it "multithreaded", the issues that can arise are very similar (if not the same?)

Top Tips:

  1. How to properly post source code - see: https://www.avrfreaks.net/comment... - also how to properly include images/pictures
  2. "Garbage" characters on a serial terminal are (almost?) invariably due to wrong baud rate - see: https://learn.sparkfun.com/tutorials/serial-communication
  3. Wrong baud rate is usually due to not running at the speed you thought; check by blinking a LED to see if you get the speed you expected
  4. Difference between a crystal, and a crystal oscillatorhttps://www.avrfreaks.net/comment...
  5. When your question is resolved, mark the solution: https://www.avrfreaks.net/comment...
  6. Beginner's "Getting Started" tips: https://www.avrfreaks.net/comment...
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

I would look at any vars that are modified in an isr, and make sure they are interrupt protected if accessed by other code. For example, a buffer may have some vars to keep track of its state like head/tail/count, where head and tail are only accessed by a single piece of code and count is accessed by more than one (at least one of them being an isr). In that case, count would have to be interrupt protected when non-isr code modifies the value. With nested interrupts, you would also have to consider whether another isr of a higher priority also accesses the same var. The 32bit mcu means your atomic problems are a little less than the 8bit mcu, but the same atomic problems still exists and you get to make up for the difference with the addition of nested interrupts/priorities.

 

Maybe not a cause of this problem and doesn't really explain the 'lockup', but atomic problems reveal themselves infrequently simply because of the probabilities they are up against. Change code in some way, and your probabilities may also change- if for the better it may become so rare that you do not know it exists, if for the worse then you get something elusive but you know it exists.

 

Anyway, I would at least get settled in my mind that atomic things are squared away so you can check off that box.

 

This reply has been marked as the solution. 
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 1

Finally sorted it...

 

This was a bit obvious and a bit subtle at the same time.

 

  1. Rx character, interrupt puts character into buffer, updates write pointer. Lots of volatiles on everything there.
  2. Main routine calls a check-for-input routine which compares the read and write pointer and returns 0xffff if they're the same, or the byte at the read point if they're different (and updates read pointer).
  3. Talk-to-ECU routine
    1. sends a requested byte
    2. receives the same byte immediately back (single wire system so rx and tx are connected)
    3. receives the response from the ECU which is usually (but not always) a bit-complement of the initial sent byte

And that's where it was broken...

3.2 was handled by a loop waiting for the check-for-input to return something other than 0xffff to confirm that the echoed byte had been received.

3.3 was handled by another loop doing the same thing, though in a slightly different way to handle three cases for the expected response byte (inverted, as sent, and no response). Which is fine, but, some idiot decided to use a uint8_t to catch the return value from check-for-input. Which is fine if there's something there, but means that the loop only ever runs once - since it can't *ever* equal 0xffff...

 

The returned byte from the emulator arrived so quickly that in almost every case the response byte arrived before it was tested for. Check-for-input cheerfully returned the received character and all was well. But on the odd occasion when the emulator was a little slower, check-for-input returned 0xffff, which was silently truncated to 0xff; that was apparently a response which did not match the expected character and therefore stopped things.

 

The fix: use uint16_t to catch the response... and now it's run without error for twenty hours. Amazing... As to why it took so long to find? Mainly because sometimes it would run several hours without glitching, and sometimes only half an hour, but either way, very few options to catch the bug, try some monitoring code additions, and see what happened. This also explains why it never failed when I had the full debugging on; the time taken to output the stream data was much longer than the time for a new byte to arrive, so it was always there when check-for-input got around to asking for it.

 

A question: is there a GCC compiler flag which raises a warning (or error) in the case of trying to stuff a value in something that's too small for it? Specifically, using a uint8_t as a receiver for a uint16_t prototyped return value?

 

Neil

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There is -Wnarrowing but I think it is only supported for C++ (-Wnarrowing doesn't have its own documentation on https://gcc.gnu.org/onlinedocs/g...).

github.com/apcountryman/build-avr-gcc: a script for building avr-gcc

github.com/apcountryman/toolchain-avr-gcc: a CMake toolchain for cross compiling for the Atmel AVR family of microcontrollers

github.com/apcountryman/avr-libcpp: a partial implementation of the C++ standard library (C++17 only) for use with avr-gcc/avr-libc

github.com/apcountryman/picolibrary: a C++ microcontroller driver/utility library targeted for use with resource constrained microcontrollers

github.com/apcountryman/picolibrary-microchip-megaavr: picolibrary Hardware Interface Layer (HIL) for Microchip megaAVR microcontrollers

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Well done!

barnacle wrote:
... check-for-input returned 0xffff, which was silently truncated to 0xff;
Some linters will warn on defects of loss of precision and sign.

C linters filter on

  • ambiguities (implementation defined, listed in the C standard)
  • known defects (pattern matching; CWE, CVE, etc)

Static analyzers should identify the defect though at a cost with value analysis (cost is price plus, analyzer is sound or unsound); assertions can be in-lieu of static analysis (assertions associated with check-for-input, assertion is the actual value is in the set of expected values)

A source code reviewer may identify the defect though there are a lot of rules and recommendations for the reviewer to remember; best is several reviewers instead of a few or one with all in a range of experience from mentee to journeyman to mentor (one can be educated and trained in code review)

 

P.S.

barnacle wrote:
Lots of volatiles on everything there.
'volatile' has side-effects; C11 has atomic types.

 


ISO/IEC 9899:201x

Programming languages — C

[page 570]

Annex I (informative) Common warnings

[1/4 page]

— An implicit narrowing conversion is encountered, such as the assignment of a long int or a double to an int, or a pointer to void to a pointer to any type other than a character type (6.3).

 

[page 572]

Annex J (informative) Portability issues

INT31-C. Ensure that integer conversions do not result in lost or misinterpreted data - SEI CERT C Coding Standard - Confluence

C Code Smell: Implicit casts should not lower precision (SonarSource)

C Code Smell: Signed and unsigned types should not be mixed in expressions (SonarSource)

 


If there's a source code snippet with the defect and the snippet is sanitized then can run that snippet through PC-lint Plus or Cppcheck; otherwise, Visual Studio has a linter (file scope)

PC-lint Plus Online Demo - Gimpel Software - The Leader in Static Analysis for C and C++ with PC-lint Plus

Online Demo - Cppcheck

/analyze (Code analysis) | Microsoft Docs

 

LINT by Jack Ganssle

 


CON02-C. Do not use volatile as a synchronization primitive - SEI CERT C Coding Standard - Confluence

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

barnacle wrote:
A question: is there a GCC compiler flag which raises a warning (or error) in the case of trying to stuff a value in something that's too small for it?

There is (but I cannot remember it now) The warning was effectively useless because it (pedanticly and possibly correctly) threw a warning on innoculous code modifying bytes such as bar here:

uint8_t foo (uint8_t bar)
{
    uint8_t baz = bar + 1;
    return baz;
}

 

It became so intensely annoying that I threw the baby out with the bathwater.

 

Last Edited: Wed. May 5, 2021 - 09:34 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Enabling MISRA identified the potential defect.

#include <stdint.h>

uint8_t foo (uint8_t bar)
{
    uint8_t baz = bar + (uint8_t)1;
    return baz;
}

 


PC-lint Plus Online Demo - Gimpel Software - The Leader in Static Analysis for C and C++ with PC-lint Plus

 

edit : PC-lint Plus' front-end is based on Clang; don't know if Clang has a narrowing warning.

 

"Dare to be naïve." - Buckminster Fuller

Last Edited: Wed. May 5, 2021 - 10:37 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Curious. I would have expected 'you're trying to put too large a variable into a space too small for it' to be a standard error. I recall warnings about loss of precision, though I don't recall the exact error, and none of them popped up with this case.

 

Even if you have a language with automatic hidden promotion (a different rant) you'd surely expect at least a warning if truncation is happening.

 

Neil

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

N.Winterbottom wrote:
There is (but I cannot remember it now)

OK - I remembered it: -Wconversion throws the warnings.

 

Here's the example: https://godbolt.org/z/Y3Kd8EKo5

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

One of C's ambiguities.

ISO/IEC 9899:201x

Programming languages — C

[page 575]

J.2 Undefined behavior

[page 576, 2/3 page, 6.3.1.4]

Narrowing may not be a defect by another C compiler; glad narrowing is detected by some C linters.

barnacle wrote:
... and none of them popped up with this case.
Possibly a defect in the compiler; might try the defective source code snippet through a validated compiler.

CompCert - Main page

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Thanks!

Good that GCC can detect such a portability issue and with the word 'may'.

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Appears the problem function is revealed in another thread, although the description of it was enough. Avoid mixing data and return values. The compiler is good when told correctly what to watch for (and its easy enough to forget to tell it), but not letting it get to the stage where the compiler has to put out a warning is better.

 

uint16_t serial_in_char (SERIALS channel)
{
    // returns a byte from the selected channel, or 0xffff if there is nothing waiting
    uint16_t     retval = 0xffff;

    if (rx_buffer[channel].read_pointer != rx_buffer[channel].write_pointer)
    {
        retval = (uint8_t) rx_buffer[channel].buffer[rx_buffer[channel].read_pointer ++];
        if (rx_buffer[channel].read_pointer == BUFFERSIZE)
        {
            rx_buffer[channel].read_pointer = 0;
        }
    }
    return retval;
}

 

 

Pointers and bools are available for hire. The compiler likes pointers and will 'point' out when you use the wrong type. Bools are easy for humans to deal with- true/false, yes/no, on/off, light/dark, up/down, good/bad, left/right, c/asm, and so on.  The bool can also sit in an if statement all alone so no == >= <= != to confuse the mind either. 

 

bool serial_in_char (SERIALS channel, uint8_t* data)
{

    //assume compiler is making sure channel is valid via enum type

 

    typeof(rx_buffer[0])      *buf = &rx_buffer[channel];
    typeof(buf->read_pointer) *rp  = &buf->read_pointer;

    if( *rp == buf->write_pointer ) return false;

 

    *data = buf->buffer[ (*rp) ++ ]; //post inc pointed value, not pointer
    if( *rp >= BUFFERSIZE ) *rp = 0;
    return true;      
}

 

uint8_t x = 0x23;

send( CHAN0, x );

 

uint8_t v; //if wrong type, compiler will let you know

while( ! serial_in_char(CHAN0, &v) ){} //wait for 'echo' on rx

if( v != x ){ /*bad echo*/ }

 

I have used the combined data/return value, but it leads to trouble or is just awkward to deal with. If needing only pass/fail from a function, return a bool. If need more, then use enums. In all cases data and return values stay in their own boxes.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Funny thing is, I cleaned it up by combining two functions; one checked whether there was data and the other grabbed it. This looked safer - and works fine when the place for the return data is the right size.

 

Isn't that what crashed that Ariane 5 some years ago?

 

Neil

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

barnacle wrote:
Isn't that what crashed that Ariane 5 some years ago?
Narrowing was a contributing factor.

Crash and Burn by Jack Ganssle

[2/3 page]

Ariane 5

[next to last paragraph]

To summarize: poorly tested code that should not have been running caused a floating point conversion error because the spec didn't call for an understanding of real flight dynamics. In an effort to keep processor loading low the variables involved weren't monitored, though others were. Two redundant SRIs running the same code performed identically and shut down. The main computer ignored the SRI "bad data" bit and try to fly using corrupt information.

...

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

-Wnarrowing is implied for C++.

#include <stdint.h>

int main() {
    uint8_t  n = 0;
    uint16_t v = 0xffff;
    n = v;
    n = {v};
}

or all-in-one at https://coliru.stacked-crooked.com/a/021e987af0fb1fd5

due to Assignment operators - cppreference.com (mid-page, Example)

 

"Dare to be naïve." - Buckminster Fuller