Software-based fault-tolerance for resilient parallel systems

Software-based fault-tolerance for resilient parallel systems
Sara Royuela and Alessio Medaglini

Cyber-physical systems provide a convenient way to represent complex systems like the use-cases targeted in the AMPERE project. The uncertainties derived from the interaction between the cyber and the physical worlds, together with the randomness in the environment, errors in physical devices and possible security attacks, may severely harm the dependability of these safety-critical systems.

AMPERE has developed a new fault-tolerance component that combines proactive monitoring and parallel replication in order to enhance the resilience and reliability of the system as follows:

  • The proactive monitoring mechanism, based on the observer design pattern, focuses on the early detection of symptoms that may cause either silent faults or erroneous results, by monitoring critical variables using specifically defined predicates.
  • The parallel replication mechanism, based on the OpenMP tasking model, focuses on the detection of erroneous results by defining the functionalities to be replicated in parallel, and the threshold and variables to check in a consensus-and-voting process.

AMPERE has tested the capabilities of these mechanisms in isolation and also combined on top of the Obstacle Detection and Avoidance System (ODAS) provided by Thales and Università di Siena. The results (Fig. 1) show an accuracy no less than 70% when using the two mechanisms together, reaching almost 100% in some cases (this varies depending on the phase being replicated/observed, as different functionalities require detection of specific errors for each phase).  Furthermore, the results (Fig. 2) reveal the overhead introduced by the mechanisms is either negligible, in the case of the observer, or considerably reduced, in the case of the replicas, by using the available resources of the parallel architecture.

Accuracy of the resiliency mechanisms, in isolation and combined, for different phases of the ODAS use-cases
Fig. 1. Accuracy of the fault-tolerance mechanisms, in isolation and combined, for different phases of the ODAS use-cases.

 

Minimum overhead of the fault- tolerance mechanisms compared to the case without them.
Fig. 2. Minimum overhead of the fault-tolerance mechanisms compared to the case without them.

 

Overall, the fault-tolerance mechanisms developed in AMPERE can be used combined or in isolation depending on the objectives to optimize, i.e., accuracy, performance, scalability and programmability.