K-Athena: a performance portable structured grid finite volume magnetohydrodynamics code

作者: Philipp Grete , Brian W. O'Shea , Forrest W. Glines

DOI:

关键词:

摘要: Large scale simulations are a key pillar of modern research and require ever-increasing computational resources. Different novel manycore architectures have emerged in recent years on the way towards exascale era. Performance portability is required to prevent repeated non-trivial refactoring code for different architectures. We combine Athena++, an existing magnetohydrodynamics (MHD) CPU code, with Kokkos, performance portable on-node parallel programming paradigm, into K-Athena allow efficient multiple using single codebase. present profiling scaling results platforms including Intel Skylake CPUs, Xeon Phis, NVIDIA GPUs. achieves $>10^8$ cell-updates/s V100 GPU second-order double precision MHD calculations, speedup 30 up 24,576 GPUs Summit (compared 172,032 cores), reaching $1.94\times10^{12}$ total at 76% efficiency. Using roofline analysis we demonstrate that overall currently limited by DRAM bandwidth calculate metric 62.8%. Finally, implementation strategies used challenges encountered maximizing performance. This will provide other groups straightforward approach prepare their own codes available this https URL .

参考文章(30)
Matthew Martineau, Simon McIntosh-Smith, Wayne Gaudin, Assessing the Performance Portability of Modern Parallel Programming Models using TeaLeaf Concurrency and Computation: Practice and Experience. ,vol. 29, ,(2017) , 10.1002/CPE.4117
Gavin Matthew Baker, Matthew Tyler Bettencourt, Steven W. Bova, Ken Franko, Marc Gamell, Ryan Grant, Simon David Hammond, David S. Hollman, Samuel Knight, Hemanth Kolla, Paul Lin, Stephen Lecler Olivier, Gregory D. Sjaardema, Nicole Lemaster Slattengren, Keita Teranishi, Jeremiah J. Wilke, Janine Camille Bennett, Robert L. Clay, Laxkimant Kale, Nikhil Jain, Eric Mikida, Alex Aiken, Michael Bauer, Wonchan Lee, Elliott Slaughter, Sean Treichler, Martin Berzins, Todd Harman, Alan Humphreys, John Schmidt, Dan Sunderland, Pat Mccormick, Samuel Gutierrez, Martin Shulz, Todd Gamblin, Peer -Timo Bremer, ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms Office of Scientific and Technical Information (OSTI). ,(2015) , 10.2172/1432926
Elias Konstantinidis, Yiannis Cotronis, A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling Journal of Parallel and Distributed Computing. ,vol. 107, pp. 37- 56 ,(2017) , 10.1016/J.JPDC.2017.04.002
Matt Martineau, Simon McIntosh Smith, James Price, Tom Deakin, Evaluating attainable memory bandwidth of parallel programming models via BabelStream International Journal of Computational Science and Engineering. ,vol. 17, pp. 247- 262 ,(2017) , 10.1504/IJCSE.2017.10011352
John K. Holmen, Alan Humphrey, Daniel Sunderland, Martin Berzins, Improving Uintah's Scalability Through the Use of Portable Kokkos-Based Data Parallel Tasks Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact. pp. 27- ,(2017) , 10.1145/3093338.3093388
S.J. Pennycook, J.D. Sewall, V.W. Lee, Implications of a metric for performance portability Future Generation Computer Systems. ,vol. 92, pp. 947- 958 ,(2019) , 10.1016/J.FUTURE.2017.08.007
Diogo Marques, Helder Duarte, Aleksandar Ilic, Leonel Sousa, Roman Belenov, Philippe Thierry, Zakhar A. Matveev, Performance Analysis with Cache-Aware Roofline Model in Intel Advisor international conference on high performance computing and simulation. pp. 898- 907 ,(2017) , 10.1109/HPCS.2017.150
Daniele Paolo Scarpazza, Marco Maggioni, Benjamin Staiger, Zhe Jia, Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv: Distributed, Parallel, and Cluster Computing. ,(2018)
Tjerk P. Straatsma, Katerina B. Antypas, Timothy J. Williams, Data and workflow management for exascale global adjoint tomography Exascale Scientific Applications: Scalability and Performance Portability. pp. 279- 306 ,(2017) , 10.1201/B21930
Daniele Paolo Scarpazza, Marco Maggioni, Zhe Jia, Jeffrey Smith, Dissecting the NVidia Turing T4 GPU via Microbenchmarking arXiv: Distributed, Parallel, and Cluster Computing. ,(2019)