Small Tensor Operations on Advanced Architectures for High-Order Applications
A. Abdelfattah and M. Baboulin and V. Dobrev and J. Dongarra and A. Haidar and I. Karlin and Tz. Kolev and I. Masliah and S. Tomov
This technical report describes our findings regarding performance optimizations of the tensor contraction kernels used in BLAST - a high-order FE hydrodynamics research code developed at LLNL - on various modern architectures. Our approach considers and shows ways to organize the contractions, their vectorization, data storage formats, read/write patterns, and parametrization as related to batched execution and parallelism in general. Autotuning framework is designed and used to find empirically best performing tensor kernels by exploring a large search space that results from the techniques described. We analyze these kernels to show the trade-offs between the various implementations, how different tensor kernel implementations perform on different architectures, and what tuning parameters can have a significant impact on performance. In all cases, we organize the tensor contractions to minimize their communications by considering index reordering that enables their execution as highly efficient batched matrix-matrix multiplications (GEMMs). We derive a performance model and bound for the maximum performance that can be achieved under the maximum data-reuse scenario, and show that our implementations achieve 90+% of these theoretically derived peaks on advanced multi-core x86 CPU, ARM, GPU, and Xeon Phi architectures. These results significantly outperform what is available today in vendor libraries. In particular, we show average performance speedups of 1:3x and 2x compared to Intel MKL on two 10-core Haswell CPUs and KNL Xeon Phi, respectively, and 3x when compared to NVIDIA CUBLAS on the latest P100 NVIDIA GPU.
Published 2017-04-18 04:00:00 as ut-eecs-17-749 (ID:609)