Skip to content Skip to main navigation Report an accessibility issue

EECS Publication

Transient Error Resilient Hessenberg Reduction on GPU-based Hybrid Architectures

Yulu Jia, Piotr Luszczek, and Jack Dongarra

Graphics Processing Units (GPUs) are gaining wide spread usage in the field of scientific computing owing to the performance boost GPUs bring to computation intensive applications. The typical configuration is to integrate GPUs and CPUs in the same system where the CPUs handle the control flow and part of the computation workload, and the GPUs serve as accelerators carry out the bulk of the data parallel compute workload. In this paper we design and implement a soft error resilient Hessenberg reduction algorithm on GPU based hybrid platforms. Our design employs algorithm based fault tolerance technique, diskless checkpointing and reverse computation. We detect and correct soft errors on-line without delaying the detection and correction to the end of the factorization. By utilizing idle time of the CPUs and overlapping both host side and GPU side workloads we minimize the observed overhead. Experiment results validated our design philosophy. Our algorithm introduces less than 2% performance overhead compared to the non-fault tolerant hybrid Hessenberg reduction algorithm.

Published  2013-06-24 04:00:00  as  ut-cs-13-712 (ID:24)

ut-cs-13-712.pdf

« Back to Listing