Towards Fault Detection Units as an Autonomous Fault Detection Approach for Future Many-Cores

Conference: ARCS 2011 - 24th International Conference on Architecture of Computing Systems
02/22/2011 - 02/23/2011 at Como, Italy

Proceedings: ARCS 2011

Pages: 4Language: englishTyp: PDF

Weis, Sebastian; Garbade, Arne; Schlingmann, Sebastian; Ungerer, Theo (Institute of Computer Science, University of Augsburg, 86135 Augsburg, Germany)

Within the next 10 years a chip might be able to host more than 1000 heterogeneous cores. With an ongoing decrease of the transistor size, the probability of physical flaws on the chip induced by voltage swinging, natural cosmic rays, thermal changes or variability in the manufacturing process will further raise. This decrease of the structure size of electronic components in conjunction with the increasing number of transistors on the same chip make faults in future many-core systems unavoidable. Although fault tolerance has always been an important part of mission critical systems, it is now new for the upcoming general-purpose many-core processors, bringing up completely new challenges. While in mission critical systems fault tolerance has always been essential at all costs, the architecture of general purpose processors is strongly influenced by economical constraints. This requires fault tolerance techniques, which are able to scale with the number of cores and the increasing failure probability on a chip in conjunction with a reasonable architectural effort.