EECS Publication
Optimizing Matrix Multiply using PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
J. Blimes, K. Asanovic, Jim Demmel, D. Lam and C.-W. Chin
BLAS3 operations have great potential for aggressive optimization. Unfortunately, they usually need to be hand-coded for a specific machine and compiler to achieve near-peak per- formance. We have developed a methodology whereby near-peak performance on a wide range of systems can be achieved automatically for such routines. First, by analyzing current ma- chines and C compilers, we've developed guidelines for writing Portable, High-Performance, ANSI C (PHiPAC, pronounced 'fiee-pack'). Second, rather than code by hand, we produce parameterized code generators. Third, we write search scripts that find the best parameters for a given system. We report on a BLAS GEMM compatible multi-level cache-blocked matrix multiply generator that produces code achieving performance in excess of 90% of peak on the Sparcstation-20/61, IBM RS/6000-590, HP 712/80i, and 80% of peak on the SGI Indigo R4k. On the IBM, HP, and SGI, the resulting routine is often faster than the vendor-supplied BLAS GEMM.
Published 1996-05-01 05:00:00 as ut-cs-96-326 (ID:368)