Programmer-directed partial redundancy for resilient HPC

作者: Omer Subasi , Javier Arias , Osman Unsal , Jesus Labarta , Adrian Cristal

DOI: 10.1145/2742854.2742903

关键词:

摘要: In this work we propose partial task replication and checkpointing for task-parallel HPC applications to mitigate silent data corruption (SDC) errors. As the complete of all application tasks can be prohibitive due resource costs, introduce programmer-directed selective mechanism provide fault-tolerance while decreasing costs. Results show that our scheme detects corrects around 65% SDC errors with only 4% overhead on average.

参考文章(6)
Priyanka Ghosh, Yonghong Yan, Deepak Eachempati, Barbara Chapman, A Prototype Implementation of OpenMP Task Dependency Support international workshop on openmp. pp. 128- 140 ,(2013) , 10.1007/978-3-642-40698-0_10
Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, Marc Snir, None, Toward Exascale Resilience ieee international conference on high performance computing data and analytics. ,vol. 23, pp. 374- 388 ,(2009) , 10.1177/1094342009347767
Xavier Teruel, Xavier Martorell, Alejandro Duran, Roger Ferrer, Eduard Ayguadé, Support for OpenMP tasks in Nanos v4 conference of the centre for advanced studies on collaborative research. pp. 256- 259 ,(2007) , 10.1145/1321211.1321241
Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Xuebin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Keyes, Bill Kramer, Jesus Labarta, Alain Lichnewsky, Thomas Lippert, Bob Lucas, Barney Maccabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S Mueller, Wolfgang E Nagel, Hiroshi Nakashima, Michael E Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad Van Der Steen, Jeffrey Vetter, Peg Williams, Robert Wisniewski, Kathy Yelick, None, The International Exascale Software Project roadmap ieee international conference on high performance computing data and analytics. ,vol. 25, pp. 3- 60 ,(2011) , 10.1177/1094342010391989
ALEJANDRO DURAN, EDUARD AYGUADÉ, ROSA M. BADIA, JESÚS LABARTA, LUIS MARTINELL, XAVIER MARTORELL, JUDIT PLANAS, OmpSs: A proposal for programming heterogeneous multi-core architectures Parallel Processing Letters. ,vol. 21, pp. 173- 193 ,(2011) , 10.1142/S0129626411000151