作者: Osman Sarood , Esteban Meneses , Laxmikant V. Kale
关键词: Computer science 、 Software 、 Fault tolerance 、 Embedded system 、 Load balancing (computing) 、 Execution time 、 Energy consumption
摘要: Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols address these concerns. Research hardware i.e., machine component has also been making progress independently. In this paper, we try bridge gap and explore potential combining both software aspects towards machines. Fault rates known double every 10°C rise in core temperature. We leverage notion experimentally demonstrate restraining temperatures load balancing achieve two-fold benefits: parallel machines reducing total execution time required applications. Our experimental results show can improve factor 2.3 reduce 12%. addition, our scheme consumption much 25%. For 350K socket machine, regular checkpoint/restart fails make (less 1% efficiency), whereas validated model predicts efficiency 20% up 2.29.