EECS Publication
Soft Error Resilient QR Factorization for Hybrid System
Peng Du, Piotr Luszczek, Stanimire Tomov, Jack Dongarra
As the general purpose graphics processing units (GPGPU) are increasingly deployed for scientific computing for its raw performance advantages compared to CPUs, the fault tolerance issue has started to become more of a concern than before when they were exclusively used for graphics applications. The pairing of GPUs with CPUs to form a hybrid computing systems for better flexibility and performance creates a massive amounts of computations that have a higher possibility to be affected by transient error - a soft error that silently modifies data causing errors to pass unnoticed. This is despite the fact that the newest Fermi generation of GPUs from NVIDIA are equipped with error correcting units to protect their memories. This problem is particularly serious for applications that employ numerical linear algebra since large sections of data are often modified between steps, and therefore even a single error could eventually propagate into a large area of result. In order to give protection to dense linear algebra computations on such hybrid systems, we developed an algorithm that is resilient to soft errors. We chose the right-looking Householder QR factorization as a demonstration of our algorithm for a hybrid system that features both GPUs and CPUs. Algorithm based fault tolerance (ABFT) is used to protect from errors in the trailing matrix and the right factor, while a checkpointing method is used to ensure the left factor is error-free. This work is based on a previous study of fault tolerance in matrix factorizations. Our contribution includes(1) a stable multiple-error checkpointing and recovery mechanism for the left-factor, which is also scalable in performance in the hybrid execution environment and does not cause severe performance degradation. (2) optimized Givens rotation utilities on the GPU to efficiently reduce an upper Hessenberg matrix to upper triangular form, and (3) a recovery algorithm based on QR update inside a hybrid system. Experimental results show that, our fault tolerant QR factorization can successfully detect and correct data altered by soft errors in both the left and right factors and we observe a decreasing percentage of overhead as the matrix size grows.
Published 2011-07-01 04:00:00 as ut-cs-11-675 (ID:37)