A Lightweight API for an Adaptive Software Fault Tolerance Using POSIX-Thread Replication

Conference: ARCS 2011 - 24th International Conference on Architecture of Computing Systems
02/22/2011 - 02/23/2011 at Como, Italy

Proceedings: ARCS 2011

Pages: 4Language: englishTyp: PDF

Personal VDE Members are entitled to a 10% discount on this title

Kouadri, Abdellah; Heron, Olivier; Montagne, Romain (CEA, LIST, Embedded System Reliability Lab, Gif-sur-Yvette, 91191, France)

Single Event Upset (SEU) is a physical phenomenon that causes a bit flipping in SRAM cells or a voltage transient in combinational gates, especially in deep submicron technologies (32nm). An SEU can be caused by a charged particle (e.g. a cosmic particle) that strikes a transistor region and then, causes a change in the region charge. The failure mode of an SEU is classified as a transient fault and may have no perceptible effect (benign fault) or may lead to a system failure (crash) if not detected. Since SEUs mainly cause data corruption in storage elements, error parity and correcting code (ECC) implementation were proposed for large memories. Another approach is the use of spatial or temporal redundancy. Triple modular redundancy (TMR) is a spatial redundant protection approach which masks soft errors through the diversification of the execution paths. In addition, the design diversification technique can be used to prevent common modes; the resources have a different design but an identical functionality. A variant of TMR uses two (identical) replicated hardware resources that are tightly synchronized, i.e. cycle-by-cycle (lockstep). However, it is practically difficult to implement lock-step and TMR techniques, and moreover, the application execution must be deterministic in order to prevent a divergence between the states of the two copies. An alternative to the previous approaches is the Fault tolerant simultaneous multi-threading (FT-SMT) technique which is an approach based on temporal redundancy.