SCEC Award Number 19176 View PDF
Proposal Category Individual Proposal (Integration and Theory)
Proposal Title Unified and Continuous Software Development for AWP-ODC-OS: Phase II
Investigator(s)
Name Organization
Yifeng Cui University of California, San Diego Alexander Breuer University of California, San Diego
Other Participants 1 graduate student TBD
SCEC Priorities 4c, 4a, 2b SCEC Groups CS, GM, EFP
Report Due Date 04/30/2020 Date Report Submitted 09/22/2022
Project Abstract
In this project, we’re transferring our initial continuous software development infrastructure for AWP-ODC-OS to a service as software stack. While previous SCEC project focuses on improving the code quality of AWP-ODC-OS itself, we shift gears in this work. Targets are extensibility and scalability of the respective continuous integration and delivery processes. We exercise the new service on SDSC’s Expanse, which features direct scheduler-integration with AWS, leveraging high-speed networks to ease data movement to/from the cloud. We also demonstrate a first-hand Cloud experience with EDGE in AWS, with sustained 1.09 Pflop/s achieved in weak scaling on 768 instances.
Intellectual Merit The continuous integration and delivery structure for AWP-ODC-OS through this project demonstrates the layout of the respective repositories and provides insights on the technical aspects of the important software and data implementation. The CI/CD pipelines implemented automatically analyze AWP-ODC-OS from simple sanity checks, e.g., memory debugging, towards a continuous verification on supercomputing infrastructure. This transformation towards modern and comprehensive software and engineering practices starts with AWP-ODC-OS, tailored to SCEC’s community open source needs in the long run, with the potential to extend the CI/CD server setups into the public clouds, e.g., Amazon's AWS EC2.
Broader Impacts This work helps translate basic research into practical products for reducing risk and improving community resilience. Cloud solutions are the best-available approach to tackle this amount of diversity, and continuous development provides improved capability of wave propagation codes, towards automated continuous verification on supercomputing infrastructure.
Exemplary Figure Figure 2 Weak and strong scalability of EDGE in AWS EC2 on c5.18xlarge instances. We sustained 1.09 Pflops in weak scaling on 768 instances. This elastic high performance cluster contained 27,648 Skylake-SP cores with a peak performance of 5 Pflop/s. The strong scaling setting on 64 instances had a performance of 53 Tflop/s. This work was published in a technical paper at ISC’2019.