Modeling of L2 Cache Behavior for Thread-Parallel Scientific Programs on Chip Multi-Processors
Fengguang Song, Shirley Moore, and Jack Dongarra
It is critical to provide high performance for scientific programs running on a Chip MultiProcessor (CMP). A CMP architecture often has a shared L2 cache and lower storage hierarchy. The shared L2 cache can reduce the number of cache misses if the data are commonly shared by several threads, but it can also lead to performance degradation due to resource contention. Sometimes running threads on all cores can cause severe contention and increase the number of cache misses greatly. To investigate how a thread's performance varies when it runs together with other threads on different cores, we develop an analytical model to predict the number of misses on the shared L2 cache, especially for thread-parallel numerical codes. We assume that the parallel threads work on homogeneous tasks and share a fully associative L2 cache. Stack processing technique and circular sequences are used to analyze the L2 trace to predict the number of compulsory misses, capacity misses on shared data, and capacity misses on private data, respectively. It is the first work to predict the number of L2 misses for threads that have the nature of memory sharing. The model has been validated by three typical scientific programs: matrix multiplication, blocked matrix multiplication, and sparse matrix-vector product on a variety of matrix sizes. The average relative error lies between 2% and 12%.
Published 2006-09-01 04:00:00 as ut-cs-06-583 (ID:141)