A 'cool' way of improving the reliability of HPC machines

作者： Osman Sarood , Esteban Meneses , Laxmikant V. Kale

关键词: Computer science 、 Software 、 Fault tolerance 、 Embedded system 、 Load balancing (computing) 、 Execution time 、 Energy consumption

摘要: Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols address these concerns. Research hardware i.e., machine component has also been making progress independently. In this paper, we try bridge gap and explore potential combining both software aspects towards machines. Fault rates known double every 10°C rise in core temperature. We leverage notion experimentally demonstrate restraining temperatures load balancing achieve two-fold benefits: parallel machines reducing total execution time required applications. Our experimental results show can improve factor 2.3 reduce 12%. addition, our scheme consumption much 25%. For 350K socket machine, regular checkpoint/restart fails make (less 1% efficiency), whereas validated model predicts efficiency 20% up 2.29.

uni-trier.de 本地加速

core.ac.uk 本地加速

illinois.edu 本地加速

illinois.edu PDF 下载加速

acm.org PDF 下载加速

sci-hub.se PDF 下载加速

参考文章(26)

Gengbin Zheng, Achieving High Performance on Extremely Large Parallel Machines: Performance Prediction and Load Balancing ,(2005)

Wu-chun Feng, Making a Case for Efficient Supercomputing: It is time for the computing community to use alternative metrics for evaluating performance. ACM Queue. ,vol. 1, pp. 54- 64 ,(2003) , 10.1145/957717.957772

Leonardo Bautista-Gomez, Seiji Tsuboi, Dimitri Komatitsch, Franck Cappello, Naoya Maruyama, Satoshi Matsuoka, FTI: high performance fault tolerance interface for hybrid systems ieee international conference on high performance computing data and analytics. ,vol. 32, pp. 32- ,(2011) , 10.1145/2063384.2063427

Esteban Meneses, Xiang Ni, Laxmikant V. Kale, A message-logging protocol for multicore systems dependable systems and networks. pp. 1- 6 ,(2012) , 10.1109/DSNW.2012.6264673

Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System ieee international conference on high performance computing data and analytics. pp. 1- 11 ,(2010) , 10.1109/SC.2010.18

J. Srinivasan, S.V. Adve, P. Bose, J.A. Rivers, Lifetime reliability: toward an architectural solution IEEE Micro. ,vol. 25, pp. 70- 80 ,(2005) , 10.1109/MM.2005.54

Esteban Meneses, Osman Sarood, Laxmikant V. Kale, Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems symposium on computer architecture and high performance computing. pp. 35- 42 ,(2012) , 10.1109/SBAC-PAD.2012.12

Osman Sarood, Laxmikant V. Kale, A 'cool' load balancer for parallel applications ieee international conference on high performance computing data and analytics. pp. 21- ,(2011) , 10.1145/2063384.2063412

Chandrakant D Patel, Cullen E Bash, Ratnesh Sharma, Monem Beitelmal, Rich Friedrich, None, Smart cooling of data centers 2003 International Electronic Packaging Technical Conference and Exhibition, Volume 2. pp. 129- 137 ,(2001) , 10.1115/IPACK2003-35059

10.

John W. Young, A first order approximation to the optimum checkpoint interval Communications of the ACM. ,vol. 17, pp. 530- 531 ,(1974) , 10.1145/361147.361115

A 'cool' way of improving the reliability of HPC machines

来源期刊

我的账户

A 'cool' way of improving the reliability of HPC machines

来源期刊

相似文章 10

我的账户