EECS Publication
A Proposal for User-Level Failure Mitigation in the MPI-3 Standard
Wesley Bland, George Bosilca, Aurelien Bouteiller, Thomas Herault, and Jack Dongarra
This chapter describes a flexible approach, providing process fault tolerance by allowing the application to react to failures, while maintaining a minimal execution path in failure-free executions. The focus is on returning control to the application by avoiding deadlocks due to failures within the MPI library. No implicit, asynchronous error notification is required. Instead, functions are provided to allow processes to invalidate any communication object, thus preventing any process from waiting indefinitely on calls involving the invalidated objects. We consider the proposed set of functions to constitute a minimal basis, which allows libraries and applications to increase the fault tolerance capabilities by supporting additional types of failures, and build other desired strategies and consistency models to tolerate faults.
Published 2012-02-24 05:00:00 as ut-cs-12-693 (ID:5)