EECS Publication
A Failure Correction Technique for Parallel Storage Devices with Minimal Device Overhead
James S. Plank, Joel Friedman, and Kai Li
A common technique for providing reliability in parallel storage designs, network file systems, and diskless checkpointing systems is the N + 1-Parity approach. This approach is simple in coding, but re- quires an excess number of additional 'checksum' storage devices to recover more than one arbitrary device failure. This paper presents a general method to recover from the failure of m arbitrary storage devices with the addition of exactly m checksum devices. The method is an application of Reed-Solomon codes, and can be viewed as a generalization of N + 1- Parity. This paper has two goals concerning this algorithm. First, it provides a complete specification of how to code this problem with this algorithm. To the authors' knowledge, this is the first such specification. Second, we have implemented the coding and recovery algorithm in software and shown that the method is effefficient, general, and practical.
Published 1994-08-01 05:00:00 as ut-cs-94-243 (ID:450)