Skip to content Skip to main navigation Report an accessibility issue

EECS Publication

Kernel Assisted Collective Intra-node Communication Among Multicore and Manycore CPUs

Teng Ma, George Bosilca, Aurelien Bouteiller, Brice Goglin, Jeffrey M. Squyres, Jack J. Dongarra

Even with advances in materials science, fundamental limits in heat and power distribution are preventing higher CPU clock frequencies. Industry solutions for increasing computation speeds have concentrated on raising the number of computational cores available, leading to the wide-spread adoption of so-called 'fat' nodes. However, keeping all the computation cores busy doing useful work is a challenge because typical high performance computing (HPC) workloads require reading and writing a steady stream of data from memory - contention for memory bandwidth becomes a bottleneck. Many commodity platforms have therefore embraced nonuniform memory access (NUMA) architectures that split up and distribute memory to be close to the cores. High-performance Message Passing Interface (MPI) implementations must exploit these architectures to provide reliable performance portability. NUMA architectures not only require specialized MPI point-to-point messaging protocols, they also require carefully designed and tuned algorithms for MPI collective operations. Multiple issues must be taken into account: 1) minimizing the number of copies required, 2) minimizing traffic to 'remote' NUMA memory, and 3) carefully avoiding memory bottlenecks for 'rooted' collective operations. In this paper, we present a kernel assisted intra-node collective module addressing those three issues on many-core systems. A kernel level inter-process memory copy module, called KNEM, is used by a novel Open MPI collective module to implement several improved strategies based on decreasing the number of intermediate memory copies and improving locality to reduce both the pressure on the memory banks and the cache pollution. The collective topology is mapped onto the NUMA topology to minimize cross traffic on inter-socket links. Experiments illustrate that the KNEM enabled Open MPI collective module can achieve up to a threefold speedup on synthetic benchmarks, resulting in a 12% improvement for a parallel graph shortest path discovery application.

Published  2010-11-19 05:00:00  as  ut-cs-10-663 (ID:65)

ut-cs-10-663.pdf

« Back to Listing