Skip to content Skip to main navigation Report an accessibility issue

EECS Publication

Fault Tolerance Techniques for High-performance Computing

Jack Dongarra and Thomas Herault and Yves Robert

This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de-facto standard technique for resilience in High Performance Computing. We present the main two protocols, namely coordinated checkpointing and hierarchical checkpointing. Then we introduce performance models and use them to assess the performance of theses protocols. We cover the Young/Daly formula for the optimal period and much more! Next we explain how the efficiency of checkpointing can be improved via fault prediction or replication. Then we move to application-specific methods, such as ABFT. We conclude the report by discussing techniques to cope with silent errors (or silent data corruption).

Published  2015-05-18 04:00:00  as  ut-eecs-15-734 (ID:592)

ut-eecs-14-734.pdf

« Back to Listing