Validation by analysis of complex safety systems

The validation of a device by analysis, especially for complex safety systems can be done on a 5 step analysis

First step : functional analysis

The purpose of the functional analysis is to identify the functions to be fulfilled by the system. It is also intended to explain the system's operation by establishing a link between the hardware and software functions. This stage is the assessment's input point. It needs to be sufficiently accurate to identify failures with an impact on the system's safety.

Several functional analysis procedures may be used to explain the operation of automatic systems :

  • functional block diagram procedure,
  • SADT procedure,
  • SA_RT procedure,
  • etc.

Second step : failure rate prediction

The purpose of the failure rate prediction is not to assess the system's reliability. Calculations are only conducted for the components with a risk in relation to safety, in order to quantify the dangerous failure rate. To that end, a calculation makes it possible to assess an equivalent failure rate of the system. This calculation comprises : component failure rates, component stress, climatic environment, component quality, etc.

The failure rate prediction allows us to quantify the FMECA (Failure Modes Effects and Criticality Analysis - See 3rd stage) and to identify the contribution of the various failure modes to the system's unsafe situation.

Failure rate calculations are grounded on databases that supply a basic failure rate for each type of component. This basic failure rate is modulated according to corrective factors according to the environment and component.

Third step  : failure modes effects and criticality analysis (FMECA)

After identifying the components fulfilling the functions (hardware and software), identified by the functional analysis, the failure modes and their effects on the system's operation must be analysed in the scope of this study. The purpose of this stage is to analyse the failures to identify “ dangerous ” failure modes, and to quantify the probability of failure occurrence.

The Failure Modes Effects and Criticality Analysis (FMECA) is conducted at electronic component detail level for the safety device. The purpose of this analysis is :

  •  to identify the “ dangerous ” failure modes to assess the “ dangerous ” failure rates leading to the hazardous event, while assessing a coverage rate for the various tests;
  • to identify the possible preventive maintenance provisions to be integrated to guarantee a safety integrity level in compliance with the defined goals.

Failures are classified in 4 classes  :

  • dangerous detected failures whose effects are on safety and availability (λDD),
  • dangerous un-detected failures whose effects are only on safety (λDU),
  • non-dangerous detected failures whose effects are only on availability (λSD),
  • non-dangerous and undetected failures whose effects are only on availability (λSU).

λDU λ Dangerous, Undetected ; λS = λ  Safe).

λS = Safe failure : i.e. a failure that results in system fallback (safe situation for safety).

λDU = Unsafe failure : failure whose consequence leads to a dangerous state from the standpoint of safety.

The following diagram (Figure A4) gives further details of this notion of distribution of failures according to their effect. The objective of this stage is to define the unsafe failure modes. References (28) and (29) are examples of sources of data for the failure mode distribution for various components.

 

Figure 1  : Failure distribution according to their effect

Fourth step : modelling of the system's various states

There are three system types according to the various encountered systems :

[1]  Failsafe systems

[2]  Non-redundant systems

[3]  Redundant systems

The system's dangerous failure probability calculation is different according to the various types of system.

Failsafe systems

Failsafe systems are systems in which the failure modes of all components of the system lead to a « safe state » in relation to safety. For these systems, there is no use in calculating the dangerous failure probability as the λDU dangerous failure rate does not exist

Non-redundant systems

Non-redundant systems are “ simple ” systems in which the safety function can be lost in the event of failure. Two states are possible : safe state or dangerous state. The calculation of the dangerous failure probability for the systems comes down to a specific reliability calculation depending on the dangerous failure rate (λDU - identified in FMECA) and with the same duration as the preventive maintenance operations.

Redundant systems

In the event of redundant systems, the safety function can be lost due to combinations of failures depending on the logic implemented within the safety system. There are several safety integrity level quantitative assessment procedures for such systems. The main drawback of the more traditional procedures such as the analysis by fault tree system, or the analysis by reliability block diagram, is that they do not always take into account the time aspect, test periodicity, coverage levels, as well as the repair rate.

The various failure and operating states can be modelled with MARKOV graphs, by integrating the time aspect of the preventive maintenance tests, the autotests as well as the coverage rate, as the electronic systems are subject to a failure law of exponential form with a constant failure rate.

A1.6.4.1         Influence of testability on safety

For safety purposes, the state of the resources must be known on a permanent basis to see if hidden (or dormant or latent) failures liable to mask the safety function exist. These dormant failures are only detected during periodic tests voluntarily conducted by the user.

A test policy is useless for failsafe systems as each failure leads to a “ safe ” position in relation to safety.

On the contrary, for systems that are neither failsafe nor autotestable and on which dangerous failures exist, a test policy to detect the “ dangerous failures ” (with a risk for safety) is required.

These tests must be conducted according to a periodicity grounded on the characteristics of the various elements constituting the system. Dangerous failures can be detected in two ways :

  • Either by the test and autotests system of the safety system for detectable failures (lDD),
  • Or during verification operations for non-detectable failures (lDU).

The PLC's reliability level is not increased by testability. It just makes it possible to ensure that resources are still available  : to read the inputs and control the outputs, on the one hand, and to make sure that the processing modules are still functional, on the other hand. Only dangerous failure detection comes into play. It is possible to detect and switch to safe position in the event of failure, thanks to this test, and therefore to better guarantee safety. The following diagram shows the impact of testability on safety, and the impact of a state changeover test policy conducted every 24 hours or every 6 months on safety.

Figure 2  : Testability impact on safety

Graph establishment

IEC 61508 (18) and reference (30) stipulate the procedure and various stages of system modelling. State graphs are represented below for each safety function. Modelling is achieved with “ states ” that the system is liable to enter. There are 3 states in  most cases :

State 2            represented as follows : (2)

This state corresponds to the modelling of redundancy. In this state, all implemented resources are present and operate in a nominal manner.

State 1            represented as follows : (1)

This state corresponds to the modelling of redundancy downgraded by the dangerous failure of a hardware element on one of two channels. In this state, all implemented resources are not present. It is an undetected dangerous failure state. Safety is still guaranteed.

State 0            represented as follows : (0)

This state corresponds to the modelling of the loss of redundancy due to the dangerous failure of several hardware elements from the channels. In this state, safety is no longer guaranteed and in the event that the safety function is called upon, the system will not go to safe position.

The “ P ” probability of being in “ 0 ” state is designated by PFD(t) in the IEC 61508 standard. The meaning of PFD(t) value is the value defined in the previous paragraph.

Assumptions

MARKOV graph modelling for the studied systems by INERIS was grounded on the following assumptions :

[1]  failure rates (l) and repair rates (m) are assumed constant to make it possible to model and calculate the safety level with MARKOV graphs.

[2]  The mission time (TI) corresponds to the intervals between the OFF LINE periodic test times. All test rates concerning the aptitude to detect state changeovers (mPTi) are stated for each arc of each graph.

[3]  Inputs and outputs do not go to the safe state if the power supply is cut off.

[4]  The common failure modes, and the systematic errors are assumed equal to those defined in reference (28). lD common mode failures or faults have the specificity of affecting all lines at the same time. The selected values are those defined in the same document.

System modelling example

Two active redundancy systems are modelled as follows

Figure 3  : Redundant system state modelling

 

This graph is equivalent to the following graph :

Figure 4  : Redundant system state reduced modelling

The “ P ” probability of being in a “ 0 ” state therefore depends on a failure rate that in turn depends on time T : P = L(t) x T.

This example shows that the more time T increases and the more the probability of being at “ 0 ” state increases.

 

English