EECS Publication
Algorithm-Based Checkpoint-Free Fault Tolerance for Parallel Matrix Computations on Volatile Resources
Zizhong Chen and Jack Dongarra
As the desire of scientists to perform ever larger computations drives the size of today's high performance computers from hundreds, to thousands, and even to tens of thousands of processors, nodes failures in these computes are becoming frequently events.
Published 2005-04-20 04:00:00 as ut-cs-05-561 (ID:163)