EECS Publication

Algorithm-based Fault Tolerance for Dense Matrix Factorizations

Peng Du, Aurelien Bouteiller, George Bosilca, Thomas Herault, Jack Dongarra

Dense matrix factorizations like LU, Cholesky and QR are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers where the ever-growing scale induces a fast decrease of the Mean Time To Failure (MTTF). This paper proposes a new algorithm-based fault tolerant (ABFT) approach, designed to survive fail-stop failures during dense matrix factorizations in extreme conditions such as the absence of any reliable components, and the possibility of loosing both data and checksum from a single failure. Both left and right factorization results are protected by ABFT algorithms, and fault-tolerant algorithms derived from this solution can be directly applied to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the overhead is sharply decreasing with the number of computing units and the problem size. We implemented the ABFT versions of LU based on ScaLAPACK as a demonstration. Experimental results on the Kraken supercomputer validate the theoretical evaluation.

Published 2011-07-28 04:00:00 as ut-cs-11-676 (ID:38)

ut-cs-11-676.pdf

« Back to Listing

The University of Tennessee, Knoxville

Min H. Kao Department of Electrical Engineering & Computer Science

EECS Publication