Reliability Bottlenecks in Integrated Parallel Fault-Tolerant Systems

Conference: ARCS 2011 - 24th International Conference on Architecture of Computing Systems
02/22/2011 - 02/23/2011 at Como, Italy

Proceedings: ARCS 2011

Pages: 6Language: englishTyp: PDF

Personal VDE Members are entitled to a 10% discount on this title

Authors:
Fechner, Bernhard (Department of Mathematics and Computer Science, Parallel Computing and VLSI Group, FernUniversit├Ąt in Hagen, 58084 Hagen, Germany)

Abstract:
The appearance of multithreaded, multi- and manycore systems has led to another advance in performance. Such systems are denoted as integrated, as long as there are electrical dependencies between functional units, i.e. multiple cores integrated on a die. With the appearance of such integrated systems, several questions concerning fault propagation arose. First, if one component fails, how likely is a faulty behavior of other components, how likely is the fault going to propagate between components? Second, what is the overall reliability of such a system? It is important to answer these questions prior to implementation, because the total costs of the final product shall be as small as possible. There are numerous fault models, especially on the physical level, dealing with propagation and development of electrical current. Computation time is essential when considering fault simulation of large systems. Our approach combines different abstraction levels in one fault model, allowing the generalized modeling of faults. Hence every fault can be modeled if its effect can be defined by an analytical function. The first level is the physical level, covering the physical effects of a fault, second a component and routing model and the last the behavioral modeling of components by finite state machines. The model can cover the whole range of parallel devices. It can help to improve reliability of current and future parallel fault-tolerant systems by identifying the underlying bottlenecks. The function of the model is exemplarily shown by applying it to an FPGA, identifying switchboxes as the main reliability bottleneck.