A 'cool' way of improving the reliability of HPC machines

作者: Osman Sarood , Esteban Meneses , Laxmikant V. Kale

DOI: 10.1145/2503210.2503228

关键词: Computer scienceSoftwareFault toleranceEmbedded systemLoad balancing (computing)Execution timeEnergy consumption

摘要: Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols address these concerns. Research hardware i.e., machine component has also been making progress independently. In this paper, we try bridge gap and explore potential combining both software aspects towards machines. Fault rates known double every 10°C rise in core temperature. We leverage notion experimentally demonstrate restraining temperatures load balancing achieve two-fold benefits: parallel machines reducing total execution time required applications. Our experimental results show can improve factor 2.3 reduce 12%. addition, our scheme consumption much 25%. For 350K socket machine, regular checkpoint/restart fails make (less 1% efficiency), whereas validated model predicts efficiency 20% up 2.29.

参考文章(26)
Leonardo Bautista-Gomez, Seiji Tsuboi, Dimitri Komatitsch, Franck Cappello, Naoya Maruyama, Satoshi Matsuoka, FTI: high performance fault tolerance interface for hybrid systems ieee international conference on high performance computing data and analytics. ,vol. 32, pp. 32- ,(2011) , 10.1145/2063384.2063427
Esteban Meneses, Xiang Ni, Laxmikant V. Kale, A message-logging protocol for multicore systems dependable systems and networks. pp. 1- 6 ,(2012) , 10.1109/DSNW.2012.6264673
Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System ieee international conference on high performance computing data and analytics. pp. 1- 11 ,(2010) , 10.1109/SC.2010.18
J. Srinivasan, S.V. Adve, P. Bose, J.A. Rivers, Lifetime reliability: toward an architectural solution IEEE Micro. ,vol. 25, pp. 70- 80 ,(2005) , 10.1109/MM.2005.54
Esteban Meneses, Osman Sarood, Laxmikant V. Kale, Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems symposium on computer architecture and high performance computing. pp. 35- 42 ,(2012) , 10.1109/SBAC-PAD.2012.12
Osman Sarood, Laxmikant V. Kale, A 'cool' load balancer for parallel applications ieee international conference on high performance computing data and analytics. pp. 21- ,(2011) , 10.1145/2063384.2063412
Chandrakant D Patel, Cullen E Bash, Ratnesh Sharma, Monem Beitelmal, Rich Friedrich, None, Smart cooling of data centers 2003 International Electronic Packaging Technical Conference and Exhibition, Volume 2. pp. 129- 137 ,(2001) , 10.1115/IPACK2003-35059
John W. Young, A first order approximation to the optimum checkpoint interval Communications of the ACM. ,vol. 17, pp. 530- 531 ,(1974) , 10.1145/361147.361115