High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors
Peng Du, Piotr Luszczek, Jack Dongarra
In the multi-peta-flop era for supercomputers, the number of computing cores is growing exponentially. However, with integrated circuit technology scaling below 65 nm, the critical charge required to flip a gate or a memory cell is dangerously reduced. Combined with higher vulnerability to cosmic radiation, soft errors are expected to become anything but inevitable for modern supercomputer systems. As a result, for long running applications on high-end machines, including linear solvers for dense matrices, soft errors have become a serious concern. Classical checkpoint and restart (C/R) scheme loses effectiveness against this threat because of the difficulty to detect soft errors in the form of transient bit flips that do not interrupt program execution and therefore leave no trace of error occurrence. Current research of soft errors resilience for dense linear solvers offers limited capability when faced with large scale computing systems that suffer both round-off error from floating point arithmetic and the presence followed by propagation of multiple soft errors. The use of error correcting codes based on Galois fields requires high computing cost for recovery. This work proposes a fault tolernat algorithm for dense linear system solver that is resilient to multiple spatial and temporal soft errors. This algorithm is designed to work with floating point data and is capable of recovering the solution of Ax = b from multiple soft errors that affect any part of the matrix during computation. Additionally, the computational complexity of the error detection and recovery is optimized through novel methods. Experimental results on cluster systems confirm that the proposed fault tolerance functionality can successfully detect and locate soft errors andrecover the solution of the linear system. The performance impact is negligible and the soft errors resilient algorithm's performance scales well on large scale systems.
Published 2011-10-02 04:00:00 as ut-cs-11-683 (ID:45)