Optimal Checkpointing Period: Time vs. Energy
Guillaume Aupy and Anne Benoit and Thomas Herault and Yves Robert and Jack Dongarra
This short paper deals with parallel scientific applications using non-blocking and periodic coordinated checkpointing to enforce resilience. We provide a model and detailed formulas for total execution time and consumed energy. We characterize the optimal period for both objectives, and we assess the range of time/energy trade-offs to be made by instantiating the model with a set of realistic scenarios for Exascale systems. We give a particular emphasis to I/O transfers, because the relative cost of communication is expected to dramatically increase, both in terms of latency and consumed energy, for future Exascale platforms.
Published 2013-10-14 04:00:00 as ut-eecs-13-718 (ID:574)