Multi-Objective Diagnosis of Non-Permanent Faults in Many-Core Systems

Conference: ARCS 2014 - 27th International Conference on Architecture of Computing Systems
02/25/2014 - 02/28/2014 at Luebeck, Deutschland

Proceedings: ARCS 2014

Pages: 8Language: englishTyp: PDF

Personal VDE Members are entitled to a 10% discount on this title

Waszecki, Peter; Lukasiewycz, Martin (TUM CREATE, Singapore)
Chakraborty, Samarjit (TU Munich, Germany)

Due to advancing technological processes, manycore systems integrate an ever-growing number of cores on one silicon die. At the same time, shrinking circuit geometries cause a higher susceptibility for hardware faults. This paper proposes a novel approach to detect defective cores in a many-core system which are showing an increased occurrence of intermittent faults. In contrast to transient faults caused by environmental phenomena, intermittent faults occur due to stressed resources and often are a precursor of permanent faults. The proposed early fault diagnosis allows the use of precautionary measures before a permanent fault can durably damage a component in a many-core system. In this paper, we present a multi-objective approach that can implicitly detect an affected core by diagnosing its intermittent faults and taking distributed applications and their dependencies into account. The implicit approach allows the waiving of explicit tests which considerably reduces the number of plausibility test functions and, thus, leads to a saving of resource load. We propose four different implementations of our early fault diagnosis which are compared and evaluated in terms of runtime and detectability. The experimental results give evidence of the feasibility and good scalability our approach.