Skip to main content

Event Details

  • Tuesday, May 23, 2017
  • 11:10 - 11:40

Shared Memory Parallelization of the Flux Kernel of PETSc-FUN3D

Shared memory parallelization of the flux kernel of PETSc-FUN3D, an unstructured tetrahedral mesh Euler code previously characterized for distributed memory SPMD for thousands of nodes, is hybridized with shared memory SIMD for hundreds of threads per node. We explore thread-level performance optimizations on state-of-the-art multi- and many-core Intel processors, including second generation Intel Xeon Phi (Knights Landing). While linear algebraic kernels are bottlenecked by memory bandwidth for even modest numbers of cores sharing a common memory, the flux kernel, which arises in the control volume discretization of the conservation law residuals and in the formation of the preconditioner for the Jacobian, is compute-intensive and effectively exploits contemporary multi-core hardware. We study its performance on the Xeon Phi in three thread affinity modes, namely scatter, compact, and balanced, with different configurations of memory and cluster modes on Knights Landing, with various code optimizations to improve alignment and reduce cache coherency penalties. The optimizations employed to reduce the data motion and cache coherency protocol penalties are expected to be of value other unstructured applications as many-core architecture evolves.