Functional Safety?

Go To Last Post
22 posts / 0 new
Author
Message
#1
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

In the context of product design and,  in particular, program design, what is generally meant by "Functional Safety"? 

 

I see a bunch of standards ( IEC ? )( and others ?). A quick but fairly robust look does not reveal what the underlying idea is.

 

Am I correct that this is intended to address life safety concerns (medical, vehicle, and such)? Or, is it more broad than that?

 

Thanks

Jim

Jim Wagner Oregon Research Electronics, Consulting Div. Tangent, OR, USA http://www.orelectronics.net

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

MC is saying use at your own risk...

 

 

SL is saying do not use...

 

 

Could it be that SL knows of a way the chip can fail that has nothing to do with the user screwing up, and MC has so much confidence in their process and design that they know only the user can make it fail the user is at fault?

 

update: wording

my projects: https://github.com/epccs

Debugging is harder than programming - don’t write code you can’t debug! https://www.avrfreaks.net/forum/help-it-doesnt-work

Last Edited: Sat. Apr 18, 2020 - 02:05 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

     From Wikipedia (Link).

 

IEC 61508 is an international standard published by the International Electrotechnical Commission consisting

of methods on how to apply, design, deploy and maintain automatic protection systems called safety-related

systems. It is titled Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related

Systems (E/E/PE, or E/E/PES).

 

IEC 61508 is a basic functional safety standard applicable to all kinds of industry. It defines functional safety as:

“part of the overall safety relating to the EUC (Equipment Under Control) and the EUC control system which

depends on the correct functioning of the E/E/PE safety-related systems, other technology safety-related

systems and external risk reduction facilities.” The fundamental concept is that any safety-related system

must work correctly or fail in a predictable (safe) way.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Chuck99 wrote:

The fundamental concept is that any safety-related system

must work correctly or fail in a predictable (safe) way.

OK, so the user's blood shall not adversely affect the correct operation of the equipment.

 

Ross McKenzie ValuSoft Melbourne Australia

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

valusoft wrote:

OK, so the user's blood shall not adversely affect the correct operation of the equipment.

 

By Jove, I think he's got it!

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

It's been a while since I've read the standards. Nevertheless, the general gist is to improve the general reliability - ie that the device in question performs its intended function correctly and attempts to detect various failures to fail gracefully.

 

 

IEC61508 is the umbrella standard for this.

 

The EU applied this to white goods as there was a trend of washing machines etc to have controls based on microprocessors and failure of these controls would result in unintended actions such as the water being continuously turned and flooding the premises. The concept of functional safety extends up towards more critical systems such as in automotive and medical where failure could cause injury or death. For these, there is the SIL (safety integrity level )rating. that determines the level of functional safety.

 

How this usually plays out in a design is:

1. CRC check of the flash code

2. CPU register check

3. maybe checksum,ECC,CRC etc of the ram at runtime.

4. external oscillator checks - has the crystal failed? is our clock running at the expected speed?

5. methods of proving the code correctness

6. watchdogs

7. validating data input

8. hardware that is resilient to external interference

9. user interface design - not confusing to the user.

10. detecting errors and reporting them - eg flashing lights on the control panel.

 

As we move up the SIL ladder, we might

 

11. add redundant systems - extra hardware etc to confirm correct operation.

12. more stringent hardware/software validation

13. use devices specifically designed for critical systems

14. more stringent testing

 

This basis for a reliable system is

1. tolerate one failure

2. that the failure can be detected

A FMEA is the usual method for documenting this.

 

For those here that are pilots, you are trained to do some basic flight checks before starting the engine. Similarly with a microprocessor based system - we do some basic 'flight checks' just to ensure everything is operational. The conundrum is how do you get a processor to check itself? If the execution platform is fault, how can the check be conclusive? The reality is that it will probably never fail the test, but if you don't test, you can never be sure!

 

Testing for the crystal frequency is somewhat easier - you can make a crystal fail simply. Many micros have crystal fail detection and internal oscillators. A simple test can determine if the crystal is somewhat one frequency or not. The intention is not to ensure the crystal is bang on spec - but rather to determine is it is oscillating and within the expected frequency. 

 

The basic premise here is: anticipate the 'usual' failure mode and test for it. With a crystal the 'usual' failure mode would be it doesn't oscillate or it might oscillate on the wrong harmonic. These are gross errors that a fairly easy to test for. It is unlikely that the crystal is the wrong frequency - this would have been tested for in manufacturing.

 

Like our pilot/plane test - the pre flight tests don't ensure the plane won't fail in flight or that the systems won't perform as expected in flight, but rather a simple test to see that it is basically operational. Obviously if a flap is binding, it is not going to improve in flight, so the plane doesn't fly. 

 

The topic of code validation is a challenging one - The likes of Cliff and I can be heard rabbiting on about MISRA etc. https://www.misra.org.uk/Publica...

The reference to this is usually the publication MISRA C. It is basically a set of rules that they have determined should avoid some of the more common pitfalls programmers fall into when using C. For Cliff, compliance with this is most likely mandatory.

 

Using industry standards is an easier way of demonstrating compliance. IEC61508 is not prescriptive, it is descriptive - ie  we're not going to tell you exactly what you have to do, but these are the methods you should follow. MISRA C is prescriptive - ie you must do it this way.

 

The general flow of software is:

document what it is expected to do

 the methods used to write the code

how the code was tested

 

The upshot is to apply known processes and tests - this doesn't guarantee the code is correct, but demonstrates some effort was applied to making it so.

 

I'm working on a system that has a reasonable degree of functional safety in it. Here's how I've addressed some of the requirements:

1. the processor has ecc on ram and flash (this is a $1 usd chip in 100 off)

2. the code 'scrubs' the ram on a regular basis - it reads all of the ram. If the hardware detects an error, we reboot.

3. flash ecc error forces a reboot

4. no external crystal used. 

5. two methods used to measure the system voltage

6. a switched load is used to verify the current sense circuit is working

7. hardware watchdog for relays - the micro has to toggle pins to make things work.

8. redundant relay

9. relay operation can be tested

10. the code to implement the logic is a finite state machine - if it locks up the hardware watchdog doesn't get toggled and the relays turn off.

11. the code self tests the hardware

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Or, maybe (also) that the equipment will not adversely effect the correct operation of the user's blood?

 

Jim

 

 

Jim Wagner Oregon Research Electronics, Consulting Div. Tangent, OR, USA http://www.orelectronics.net

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Jim,

 

I take it you already found and read...

 

https://en.m.wikipedia.org/wiki/...

 

which bit of that wasn’t clear?

 

(in the world where I work we are guided by ASIL https://en.m.wikipedia.org/wiki/... )

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

There's more to Functional Safety than simply following the rules (MIRSA etc) or satisfying ISO or EN standards. It needs careful thought about how your product may be used (or abused), who may be using it; for what purpose;  and what conceivable thing could go wrong.

 

The Boeing MCAS saga comes to mind where the 737MAX isn't inherently unsafe, but poor airmanship combined with high workload can become unsafe.

 

I'm sure we've all ready many articles about MCAS but this article is a particularly well written one which I found fascinating.

http://www.eaa124.org/Newsletter/Oct2019.pdf

 

The original article was published here but is behind a login-wall (if that word exists)

https://www.nytimes.com/2019/09/18/magazine/boeing-737-max-crashes.html

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

One of the things I worry about, reliability-wise, is that my products (broadly, data-logging environmental sensors) may malfunction in a way that causes them to keep logging, but with faulty data. I can easily see scenarios in which that faulty data is used to make decisions that end up adversely impacting peoples lives, or health, or negatively impact the environment when it should be doing just the opposite. Of course, no single sensor should be used for such decisions, but we know that they are.

 

I really struggle, trying to figure out ways to insure that the data is valid. But, when good data can vary from "DC to daylight", a good scheme still eludes me. For example, we know that -250F for air temperature is bad, because there is no place on earth that is that cold in the natural environment. But, is -10F bad or not? Sometimes, it could be.

 

As a result, I am left with a series of hardware tests - is the clock reasonable? - Do I get the expected "Who-Am-I" result from an I2C sensor? - Does that SPI peripheral behave as it should? Then, it is always a puzzle about what to do, if one of these isn't valid. Stop recording seems to be the only safe action!

 

So, I keep looking, trying to see what other engineers or programmers do. 

 

Jim

Jim Wagner Oregon Research Electronics, Consulting Div. Tangent, OR, USA http://www.orelectronics.net

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

N.Winterbottom wrote:
The Boeing MCAS saga comes to mind where the 737MAX isn't inherently unsafe, but poor airmanship combined with high workload can become unsafe.
The flight crew immediately before Lion Air 610 had the blessing of a check pilot in the jump seat when the same issue occurred.

As the flight crew was absolutely immersed in "aviate, navigate, communicate" (ie overloaded) the check pilot correctly diagnosed the loss of pitch control issue and correctly generated a recommendation (IIRC, turn off automated elevator trim, two switches forward and right of the throttles, then the pilot-in-control can manually trim the aircraft)

Boeing 737 are popular aircraft for relatively inexperienced first officers.

Bites in the ass are via Boeing, Airbus, etc as flight software defects are a bit common.

Likewise with weapon software defects though the defect rate is greater (less risk adverse?)

The Patriot Missile Failure via ECE 4760 (Cornell University, School of Electrical and Computer Engineering, Reading, General)

 

"Dare to be naïve." - Buckminster Fuller

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Actually, I did not find the Wikipedia reference that Cliff linked. There is a very similar, but seemingly a bit more brief, that I did find at 

 

https://en.wikipedia.org/wiki/Fu...

 

The problem that I have with such articles is what they mean and how you can apply them in various applications. What does "no harm" mean when that harm might be subtle or indirect (e.g. decisions by others based on faulty results). Think airspeed indicator. If that is "bad", a pilot could incorrectly take the wrong aircraft control actions. A "commercial" plane would probably have more than one way of sensing air speed, but a light aviation aircraft might not. This is an example of a device that does NOT cause harm, on its own, but which could cause someone else to take harmful action. 

 

Jim

Jim Wagner Oregon Research Electronics, Consulting Div. Tangent, OR, USA http://www.orelectronics.net

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

That is a challenging question.

 

Perhaps you can mount the accelerometer in a gimbal and once an hour have a servo spin the device to make sure each axis records + 1 g... <smirk>

 

It seems it becomes a question of system size, cost, mission criticality, etc.

 

You could add a second accelerometer, but then:

Do you put it on the same I2C bus?

Do you put it on a separate I2C bus?

Do you put it on its own micro, and report data to a supervisor micro?

Does the Supervisor micro use a common power supply?

Does each senor & micro, and the supervisor micro have their own power supplies?

Who watches the Supervisor...?

etc.

 

Part of this boils down to what failure / error modes might you encounter, and their probability, and their impact on the system's data.

If you lose a light sensor's output, but maintain the tree sway data, then maybe that partial failure can be easily recognized and some of the data is still useful, so total system shutdown isn't the best approach.

 

In a car the loss of an anti-lock braking system sensor on one wheel turns off the entire ABS system, and it reverts to non-ABS mode braking.

Yet if braking for one wheel fails, the system maintains braking for the other wheels, as that is a better failure mitigation strategy.

It doesn't shut down the entire braking system to all wheels.

 

Memory allowing you can attempt to perform an integrity check on the data as it is obtained.

A submarine can't go from 200 meters below surface depth to 20 meters BSD between readings, it takes time to ascend.

Likewise the tree you are tracking can't go from bending 10 degrees in the South direction to bending 10 degrees in the North direction between readings, that also takes time.

 

A sudden catastrophic sensor failure, (lightning, voltage spike, broken connection on a corroded solder joint, etc.), is easily detected by trending the data and having a valid data window.

 

Have duplicity in the sensors is more costly, but allows for "independent" measurements of the given parameter.

They ought to be within X % of each other.

If not, there is an error to be dealt with.

So easier, (perhaps), error detection for one signal processing pathway and a single fault.

 

Interestingly, on my last project, I included a second sensor from a different manufacturer, connected via a non-I2C interface, as a sanity check on the primary sensor's integrity on power up.

(Thereafter, however, just the primary sensor's data was utilized.)

 

JC

Edit: Typo

Last Edited: Sat. Apr 18, 2020 - 09:48 PM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Jay,

Sorry for complete off topic but I keep meaning to ask how you are involved in Covid support and how you are coping with things? Above all hope you are keeping safe!

Cliff

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ka7ehk wrote:
malfunction in a way that causes them to keep logging
I could not help noticing the (albeit accidental) alignment of those words with the current deforestation issue in Brazil.

Ross McKenzie ValuSoft Melbourne Australia

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Hi Cliff,

Perhaps Jim won't mind a brief side track.

 

My main job is working in the ER, and I have a 50% increase in scheduled shifts for May...

 

Lately, interestingly, the ER volume, (# patients / day), has actually been down.  Those that I see tend to be pretty sick, but many of the "minor complaint" patients aren't coming to the ER these days.  Perhaps they are afraid of catching Covid-19.  The vast majority of the Covid or presumed Covid patients that I see get tuned up in the ER and sent home, (thankfully!).  My first two patients last shift were a 42 year old male having a massive heart attack, followed by an elderly gentleman having a stroke.  So the non-Covid-19 illness and injury certainly continue unabated!

 

My next job is as the EMS Med Dir for a number of Fire/EMS and Private ambulance services.  That has kept me very busy.  I've been on waaaay too many teleconferences to count.  Dealing with other EMS Docs, ER Docs, local, county, and State Health Departments, Emergency Management Agencies, Fire Chief meetings, CDC webcasts, etc.  I'm truly tired of Covid-19!  One has to sort through tons of rhetoric and poorly founded theoretically based believes to determine what actually constitutes a reasonable course of action based upon what is currently known. 

 

Lots of planning in case things do go poorly, (i.e. a big surge in ER volume, or EMS call volume far exceeds capability, or too many ER or EMS staff are out ill with the infection to maintain sufficient staffing, etc.).  And the number of minute details in terms of testing, quarantining, decontaminating, protective ware (PPE), and its potential recycling, goes on and on.

 

So, Life Has Been Interesting!

 

I, too, have been doing more work at home, also, and have had numerous training and QA meetings with my EMS agencies canceled.  So when I most would like to have face-to-face time with them to discuss Covid issues I end up doing so via eMail, which isn't as effective in spreading's one's message, or in squelching fear and mis-truths. 

 

I also have to deal with a lot of paranoia.  As an engineer at heart I approach many of the Covid issues very differently than some of my peers and others.  I like data, (when its available), and I like to plot out step-by-step potential consequences of numerous "recommendations" being made, be it at the local level or higher.  Often the voice of reason is drown out by the less well informed majority...

 

Two closing thoughts...  Just turn off CNN and the TV.  It's all Doom & Gloom and stir the pot.  In the two hospital / three ER system in which I work I think the last numbers I saw from a couple days ago were that we had 14 Covid-19 (+) patients on ventilators, all of whom eventually went home! (Of all of the Covid-19 (+) and presumptive Covid patients we have seen I think we have had <= two deaths, and those were individuals with significant underlying health issues, (Don't quote me on that number, I am too tired to look up the data at the moment, and that is from memory.)  

 

Remember, 80% of the people who get Covid-19 will have either no symptoms or only minor symptoms.

80% !  That is the vast majority of us!

15% will have moderate symptoms, so you'll definitely feel sick, but you're going to do fine.   

5% will have significant symptoms and end up in the hospital and/or the intensive care unit.  As noted above, many of those individuals will also have a good outcome.  But not all, (sadly).

 

I personally believe that we are striving for "Herd Immunity".  The Herd could be the local nursing home, one's town, New York City, a State, a county, or the entire earthly human population.  Take your pick.  In any event, those of us who live in reasonably populated environments, (City or Town), and not in the middle of Montana, (for example), WILL get the infection, sooner or later.

 

The reason to "flatten the curve" is not to prevent one from getting the infection, it is to prevent the country, (the entire herd), from getting the infection over 6 weeks, whereupon 5% would exceed the country's health care system's ability to care for the critically ill patients, and to instead spread that time frame for herd (country) immunity out over 6 months or longer, whereupon the health care system can cope with caring for the critically ill.  (Chose your own example timeframes, those sited are for the sake of argument only.)

 

BTW, don't ask me how much faith I have in the various computer simulation models that some very big league academic centers are touting...  There are far too many variables and too little data and too little prior experience with this type of infection to accurately model its course.  (Although the virus may be related to SARS and MERS, the outbreak / pandemic is not.)

 

Life goes on!  I did go on "Spring Break" a few weeks ago with my family, and spent the mornings on teleconferences, and the afternoons tinkering on a microcontroller project, such as a I like to do when I'm on vacation.  It was good to have a little "down time".

 

Oh well, I could spend hours and pages chatting about this, but I'd rather stick primarily to embedded design projects on the Forum.  I don't wish to get into medical debates. 

That said, thank you for asking!  I appreciate your concern and interest, and that of a few other Freaks who I've been in touch with.  It is certainly nice to have an extended family that looks out for each other!

 

This too will pass!

 

Take care,

 

Jay    

 

  

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

Concerning Ross' comment, a short anecdote might help.

 

I recently exhibited at a local Tech Expo associated with Oregon State University. Forestry is a big deal here. Lots of  people own forest land, study forestry, or work in the forest products industry, from logging to lumber and plywood mills to hauling finished wood  products. The sign at the back of my booth said "Logging Accelerometers". Duhh, I never even thought about that until the 3rd or 4th person asked me what it had to do with the wood products industry. I didn't "get it" the first time, but finally caught on. 

 

Yet, actually, the whole genesis of the product was brought on exactly by logging in the  Amazon. The logging is so  extensive that the climate is actually changing. Rainfall patterns are different than just a few dozen years ago. The original ones were built to try  to help estimate what the real rainfall is. It is a difficult thing to measure because a lot of the rain is intercepted by the tree canopy and re-evaporates back into the atmosphere before reaching the ground.

 

Jim

Jim Wagner Oregon Research Electronics, Consulting Div. Tangent, OR, USA http://www.orelectronics.net

Last Edited: Sun. Apr 19, 2020 - 01:56 AM
  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

ka7ehk wrote:
The problem that I have with such articles is what they mean and how you can apply them in various applications.

 

With many of the EU standards being descriptive, the onus of interpretation is put on the reader. With the machinery directive, there are probably 100's of standards that you need to determine which ones are applicable and those which are not. Generally I rely on the test house to guide me - this can be challenging.

 

Pretty well the Doc has summed it up. 

 

Reading the device ID via I2C is a valid functional test. Clearly, if this fails, the device or I2C is not working. If there is a fault that causes the bus to work intermittently, then hopefully that will show up with the test failing occasionally.

As to whether the sensor is giving you valid data? If you read it ten times and it gives you the same or similar value (assuming the value isn't expected to change much), then we can assume the value to be correct. The only way to verify is to have a secondary reference. What to do if you detect a failure? Flag the data as bad. You could set the value to fullscale or a magic number to indicate failure, but that can hide evidence that might be handy in determining the fault later on.

 

Complex systems like GNSS or image recognition have a 'goodness' value - this reflects how correct it thinks the result is. This is usually done by matching the data with a model. This may be compute intensive for a simple data logger, so that might have to be done in post processing.

 

If we take an I2C temperature sensor -

we can read the ID

we can test the result to see if it is within a valid range

we can test the rate of change of the data vs what is physically possible or acceptable.

 

without any other input, this is probably the limit of functional testing one could do

 

If you wanted more tests, you could locate a resistor close to the sensor and pass a known current to give a known temperature rise.

 

If you have other sensors you might be able to infer certain things. Eg for a car ECU the throttle position sensor is fairly critical. Using the engine rpm we can infer what we think the throttle position should be as this regulates the amount of air into the engine and thus the speed of the engine. If the engine is doing 4000RPM and our throttle reading is showing 0 (closed), then there is something wrong. Another example of inference is the engine temperature sensor - if we know how long the engine has been running, we know the temperature must have risen and is within a certain range. If the temperature sensor doesn't reflect that, then there is an error.

 

It might be that there is no easy way of validating your data apart from having more than one logger in an area and post processing the data to see if there is correlation.

 

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0


Apologies Jim. Could not let this opportunity to go unsupported.

 

DocJC wrote:
Two closing thoughts...  Just turn off CNN and the TV.

Ross McKenzie ValuSoft Melbourne Australia

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

One interesting point is that some functional safety is not necessarily compatible with minimum power scenarios: as an example, a project on which I worked some time ago completely forbade using GPIO inputs to detect any of the many switch inputs. Most of us here would think nothing of wiring a switch straight into a GPIO port, using an internal pull up/down to define the default state.

 

On this project, all such inputs were analogue inputs. Externally, each required resistors smaller by a factor of ten or so than the nominal internal pullup values going to both rails - so that an open circuit on the switch circuit gave a half-rail value and which can immediately be recognised when sampled. Switch levels were usually detected as above 90% or below 10%; anything between those values was an error. Equally, any digital output was measured on a different pin for expected values.

 

Neil

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

This is similar to what I'm doing on my current project. The old alarm loop concept - a couple of resistors so you can determine multiple states - short, open, active, inactive. With the micro I'm using, the 12bit completes in around 1.5us, so there's little overhead in using analog inputs.

  • 1
  • 2
  • 3
  • 4
  • 5
Total votes: 0

"Dare to be naïve." - Buckminster Fuller