A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators
Hatem Ltaief, Stanimire Tomov, Rajib Nath, Peng Du, and Jack Dongarra
We present a Cholesky factorization for multi-core with GPU accelerators. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the huge gap between the GPUs' compute power vs the CPU-GPU communication speed. We show an approach that is largely based on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing. The algorithm features two levels of nested parallelism. A coarse-grained parallelism is provided by splitting the computation into tiles for concurrent execution between GPUs. A fine-grained parallelism is further provided by splitting the work-load within a tile for high efficiency computing on GPUs but also, in certain cases, to benefit from hybrid computations by using both GPUs and CPUs. Our resulting computational kernels are highly optimized. An efficient task scheduling mechanism ensures a load balanced execution over the entire multi-core with GPU accelerators system. Furthermore, the communication overhead is minimized by trading off the amount of memory allocated on GPUs. This results in a scalable hybrid Cholesky factorization of unprecedented performance. In particular, using NVIDIA's Tesla S1070 (4 C1060 GPUs, each with 30 cores @1.44 GHz) connected to two dual-core AMD Opteron @1.8GHz processors, we reach up to 1.163 TFlop/s in single and up to 275 GFlop/s in double precision arithmetic. Compared with the performance of the embarrassingly parallel xGEMM over four GPUs, where no communication between GPUs are involved, our algorithm still runs at 73% and 84% for single and double precision arithmetic respectively.
Published 2009-11-26 05:00:00 as ut-cs-09-646 (ID:78)