Authors: Neil Butcher (University of Notre Dame)
Abstract: Several recent systems in the Top500 include many-core chips with complex memory systems, including multiple memory channels. Many many-core chips feature an intermediate layer of memory with higher bandwidth and lower capacity then main memory. Intermediate memory exists either in a cache or a separate address space.
This paper uses Intel's Knights Landing (KNL) processor as a testbed, it includes both intermediate memory and multiple architectural knobs to adjust affinity. We present cache-oblivious and chunking algorithms for sort, matrix-multiply and Fast Fourier Transforms (FFT), and compare to state of the art codes. Experimenting with a wide range of problem types and algorithmic solutions gives insight into how affinity can affect performance. Chunking often achieves low utilization of the memory system as the cost of adding threads to move data outweighs the benefit of improved bandwidth. The results achieved with straightforward cache-oblivious codes are competitive with state-of-the-art codes.
Best Poster Finalist (BP): no
Poster summary: PDF
Back to Poster Archive Listing