An Improved Parallel Singular Value Algorithm and Its Implementation for Multicore Hardware
Azzam Haidar and Piotr Luszczek and Jakub Kurzak and Jack Dongarra
Abstract The enormous gap between the high-performance capabilities of today's CPUs and off-chip communication poses extreme challenges to the development of numerical software that is scalable and achieves high performance. In this article, we describe a successful methodology to ad- dress these challenges-starting with our algorithm design, through kernel optimization and tuning, and finishing with our programming model. All these lead to development of a scalable high-performance Singular Value Decomposition (SVD) solver. We developed a set of highly optimized kernels and combined them with advanced optimization techniques that feature fine-grain and cache-contained kernels, a task based approach, and hybrid execution and scheduling runtime, all of which significantly increase the performance of our SVD solver. Our results demonstrate a many-fold performance in- crease compared to currently available software. In particular, our software is two times faster than Intel's Math Kernel Library (MKL), a highly optimized implementation from the hardware vendor, when all the singular vectors are requested; it achieves a 5-fold speed-up when only 20% of the vectors are computed; and it is up to 10 times faster if only the singuar values are required.
Published 2013-10-29 04:00:00 as ut-eecs-13-720 (ID:578)