EECS Publication
An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware
Azzam Haidar and Piotr Luszczek and Jakub Kurzak and Jack Dongarra
Abstract The enormous gap between the high-performance capabilities of today's CPUs and off-chip communication poses extreme challenges to the development of numerical software that is scalable and achieves high performance. In this article, we describe a successful methodology to ad- dress these challenges-starting with our algorithm design, through kernel optimization and tuning, and finishing with our programming model. All these lead to development of a scalable high-performance Singular Value Decomposition (SVD) solver. We developed a set of highly optimized kernels and combined them with advanced optimization techniques that feature fine-grain and cache-contained kernels, a task based approach, and hybrid execution and scheduling runtime, all of which significantly increase the performance of our SVD solver. Our results demonstrate a many-fold performance in- crease compared to currently available software. In particular, our software is two times faster than Intel's Math Kernel Library (MKL), a highly optimized implementation from the hardware vendor, when all the singular vectors are requested; it achieves a 5-fold speed-up when only 20% of the vectors are computed; and it is up to 10 times faster if only the singuar values are required.
Published 2013-10-29 04:00:00 as ut-eecs-13-720 (ID:578)