Recent Improvements to AWP-ODC-GPU supported by Blue Waters PAID Program

Yifeng Cui, Daniel Roten, Peng Wang, Dawei Mu, Carl Pearson, Kyle B. Withers, William H. Savran, Kim B. Olsen, Steven M. Day, & Wen-Mei Hwu

Submitted August 15, 2016, SCEC Contribution #6711, 2016 SCEC Annual Meeting Poster #344

Supported by the NSF’s Petascale Application Improvement Discovery (PAID) program, the SCEC computational team has recently worked with the Improvement Method Enablers (IME) team of UIUC, led by Wen-Mei Hwu, to optimize the nonlinear AWP-ODC-GPU code for scalability and efficiency. During the course of this project, plasticity and frequency-dependent attenuation features have been introduced. This code integration and optimization made it possible to run a nonlinear M 7.7 earthquake simulation on the southern San Andreas fault with a spatial resolution of 25m for frequencies up to 4 Hz on 4,200 Blue Waters GPUs at the National Center for Supercomputing Applications.

Various strategies have been employed to optimize performance and utilize accelerator technology, including yield factor interpolation, memory tuning to increase occupancy, communication overlap using multi-streaming, and parallel I/O to support concurrent source inputs. To ensure enough occupancy to hide memory latency and reduce redundant accesses, we use shared memory to cache velocity variables, and texture cache combined with register queues to cache read-only earth model parameters. After these optimizations, the performance of the plasticity kernel improved by 45% and the performance of the stress kernel improved by 20%. The plasticity kernel reached 75% of theoretical maximum DRAM throughput and an access fraction of almost 1.

By utilizing the multi-streaming technique, the linear implementation of AWP-ODC-GPU is able to overlap communication with computation, with a parallel efficiency of 100% achieved on 16,384 XK7 GPUs. Due to the larger communicational volume required by the non-linear implementation, the parallel efficiency initially dropped to 94.3% on 3,680 Blue Waters XK7 nodes. We thus employed Cray’s Topaware tool to aid node selection and MPI rank by mapping the virtual Cartesian grid topology MPI layout onto the system's torus interconnection network. Our nonlinear implementation was able to obtain a speedup of 5.6%, and regain the parallel efficiency to 99.5% at the scale. Overall, the computational cost of the nonlinear code was reduced from initially 4x to less than 50% with respect to the highly efficient linear code, with 2.9 Pflop/s recorded in benchmarks on OLCF Titan.

Key Words
AWP-ODC, Blue Waters, GPU, scalability, efficiency

Cui, Y., Roten, D., Wang, P., Mu, D., Pearson, C., Withers, K. B., Savran, W. H., Olsen, K. B., Day, S. M., & Hwu, W. (2016, 08). Recent Improvements to AWP-ODC-GPU supported by Blue Waters PAID Program. Poster Presentation at 2016 SCEC Annual Meeting.

Related Projects & Working Groups
Computational Science (CS)