Systematic Analysis of Faults and Errors

SAFE, or Systematic Analysis of Faults and Errors, is a hazard analysis process for software- and hardware-based systems that is based on (and designed to work with) System Theoretic Process Analysis (STPA). While STPA excels at the analysis of sociotechnical systems, my research looks at the application of SAFE to non-sociotechnical elements of systems. There are two documentation formats for the analysis, a manual version (M-SAFE) which uses spreadsheets, and a tool-supported version (T-SAFE) which uses AADL to integrate the analysis with a semiformal description of a system’s architecture. The full explanation of SAFE is available in Chapter 4 of my dissertation.

Overview

Process

At its core, SAFE is a backwards-chaining analysis consisting of three activities. Activity 0 is performed once per system, while Activities 1 and 2 are performed recursively throughout a system’s specification:

  1. Defining a system’s fundamentals, which consists of:
    1. Specifying notions of loss (accidents), ways those losses can occur (hazards), and constraints that prevent the occurrences (safety constraints).
    2. Specifying the individual system elements (components and connections) that make up the system and how they fit into the system’s control structure.
  2. Examining an individual element’s interactions, by:
    1. Defining the element’s individual notion of loss / undesirability (its successor dangers).
    2. Considering if various classes of input errors (i.e., messages that are late, early, never arrive, have a wrong value, etc.) would be unsafe (termed manifestations).
    3. Mapping the element’s manifestations to its successor dangers: these are the element’s unsafe interactions.
  3. Examining an individual element’s internal faults, by:
    1. Considering the removal of entire classes of internally-caused problems (faults) based on various properties of the element (e.g., if it’s implemented in hardware or software, if other controllers interact with it, etc.).
    2. Considering if various classes of faults (i.e., development faults, adversaries gaining access to the element, physical deterioration, etc.) would be unsafe
    3. Mapping the unsafe faults to the element’s successor dangers: these are the element’s internally-caused dangers.
SAFE Overview
Analysis of an individual element in SAFE, which assumes that Activity 0 has already been performed. Note that after Activity 1’s second step (1.2) is complete, there are three options: Activity 1 step 3, Activity 2, or chaining backwards (→). The solid rectangle represents the component under analysis, while dashed lines represent predecessor (right) and successor (left) components. Black lightning bolts are potential dangers, grey bolts are those the analyst determines to be not dangerous or not possible.

Example

As an example, consider an electric tea kettle. The first few steps an analyst would go through when analyzing the system would be:

  1. Fundamentals:
    1. Accident: Kettle base is damaged or destroyed.
    2. Hazard: Kettle base overheats and melts.
    3. Safety Constraint: Heating element should be disabled above 100°C
    4. Control Structure: The core control loop is shown below. Note that the thermometer, controller and heating element are inside the system boundary, while the kettle base is the controlled process and so is considered part of the environment.
  2. Analysis begins at the first element inside the system boundary: the heating element.
    1. Successor Danger: Running when the kettle’s temperature is ≥ 100°C
      1. Note that the first element’s successor dangers are the system-level hazards, other elements will use their successor’s manifestations.
    2. Manifestations: The following unsafe input guidewords would manifest as the successor danger:
      1. Late: The toggle-state command is late when the heating element is running and the temperature is ≥ 100°C
      2. Halted: The toggle-state command never arrives when the heating element is running
      3. Erratic: An inappropriate toggle-state command arrives when the heating element is off and ≥ 100°C
      4. Note that these manifestations assume that the controller produces binary “toggle state” commands — if it produced value-based commands (e.g., run for 10 seconds) then we would also need the “High” manifestation.
  3. At this point the analyst has a choice between the following options (any or all options can be performed at this point, further analysis can be done in parallel):
    1. Document Unsafe Interactions: Fully document how the manifestations map to the successor danger and specify any possible detection / compensation steps.
      1. For example, the analyst would explain how the late arrival of a toggle-state command could, in certain worst-case situations, lead to overheating the kettle base. The analyst would then also document any possible compensatory actions, like an auto-off timer.
    2. Chain Backwards: Move analysis to the controller, using the manifestations as its successor dangers
      1. Here, the analyst would repeat Activity 1, but this time consider the controller. As the controller can’t directly overheat the kettle base, it uses the heating element’s manifestations as its successor dangers. That is, the controller must not produce late commands, stop sending commands, or produce erratic commands.
    3. Consider Faults: Analyze how internal failures of the heating element could cause the successor danger even if the controller’s input is correct.
      1. Here the analyst would perform Activity 2 on the heating element, considering what would happen if, for example, the heating element deteriorated over time.
Simple Control Loop
The control structure of an electric tea kettle.

Manual Process (M-SAFE)

Full Process and Worksheets

There are two documents which should be used to perform SAFE: the full process specification and blank worksheets. Both are available in standalone formats or Google Docs, though the latter should be used if possible since the technique continues to be refined.

Full Example

There is also a fully-worked example based on a medical application called the “PCA Interlock Scenario.” Most of my publications contain short descriptions of the scenario, or there’s a full description of the problem and proposed solution here.

Tool-Assisted Process (T-SAFE)

The annotations and properties used for T-SAFE are provided in a more developer-friendly format in the MDCF-Architect‘s documentation.