Performance evaluation of LU factorization through hardware counter measurements
Simplice Donfack, Stanimire Tomov, and Jack Dongarra
The growing demand for scalable and effective scientific and numerical libraries on multi-core architectures forces hardware manufacturers to design solutions that improve both the processor speed and transfer rates between their memory hierarchies. Several studies show that these improvement factors are disproportionate and may vary widely from one architecture to another and then have a strong impact on the tuning and the performance prediction of numerical libraries. In this paper, we analyze the communication and performance of some routines in well known libraries on different architectures and we establish a relation model between hardware parameters and performance. We focus on the LU factorization, which is one the most popular algorithms in the scientific field, therefore also used as a benchmark, e.g., the HPL benchmark to rank the TOP500 supercomputers. Our experiments in terms of hardware counter measurements allow us to predict the performance behavior of numerical algorithms (LU in particular) on different architectures.
Published 2012-10-01 04:00:00 as ut-cs-12-700 (ID:12)