Skip to content Skip to main navigation Report an accessibility issue

EECS Publication

Real Number Codes with Optimal Numerical Stability

Zizhong Chen

Today s long running high performance computing applications typically tolerate fail-stop failures by checkpointing. However, applications such as dense linear algebra computations often modify a large mount of memory between checkpoints and checkpointing usually introduces considerable overhead when the number of processors used for computation is large. It has been demonstrated in [13] that single fail-stop failure in ScaLAPACK matrix multiplication can be tolerated without checkpointing at a decreasing overhead rate of 1= pp, where p is the number of processors used for computation. Multiple simultaneous processor failures can be tolerated without checkpointing by encoding matrices using a real-number erasure correction code. However, the oating-point representation of a real number in today's high performance computers introduces round offff errors which can be enlarged and cause the loss of precision of possibly all digits during recovery when the number of processors in the system is large. In this paper, we present a class of Reed-Solomon style real-number erasure correcting codes which is numerically optimal during recovery. We analytically construct the numerically best erasure correcting codes for 2 erasures and develop an approximation method to computationally construct numerically good codes for 3 or more erasures. We prove that it is impossible even for the numerically best minimum redundancy erasure correcting codes to correct all erasure patterns when the total number of processors is large. We give the conditions that guarantee to correct all two erasures. Experimental results demonstrate that the proposed codes are numerically much more stable than existing codes.

Published  2009-04-28 04:00:00  as  ut-cs-09-641 (ID:73)

ut-cs-09-641.pdf

« Back to Listing