Versioning Architectures for Local and Global Memory

作者: Hajime Fujita , Kamil Iskra , Pavan Balaji , Andrew A. Chien

DOI: 10.1109/ICPADS.2015.71

关键词:

摘要: Future supercomputer systems will face serious reliability challenges. Among failure scenarios, latent errors are some of the most and concerning. Preserving multiple versions critical data is a promising approach to deal with such errors. We developing Global View Resilience (GVR) library, multi-version global arrays as one key features. This paper presents three array versioning architectures: flat array, change tracking, log-structured array. use synthetic workload that mimics memory access patterns radix sort, N-body simulation, matrix multiplication, comparing architectures in terms runtime performance, requirements, version restoration costs. The experiments show tracking best architecture for frequencies 10-5 opsm1 or higher matching second beating it by up 23 times, whereas preferable low usage, since saves 98% compared

参考文章(32)
Christian Bienia, Kai Li, Benchmarking modern multiprocessors Princeton University. ,(2011)
Hajime Fujita, Nan Dun, Zachary A. Rubenstein, Andrew A. Chien, Log-structured global array for efficient multi-version snapshots ieee acm international symposium cluster cloud and grid computing. pp. 281- 291 ,(2015) , 10.1109/CCGRID.2015.80
James S. Plank, Kai Li, Micah Beck, Gerry Kingsley, Libckpt: transparent checkpointing under Unix usenix annual technical conference. pp. 18- 18 ,(1995)
Franck Cappello, Geist Al, William Gropp, Sanjay Kale, Bill Kramer, Marc Snir, None, Toward Exascale Resilience: 2014 Update Supercomputing Frontiers and Innovations: an International Journal archive. ,vol. 1, pp. 5- 28 ,(2014) , 10.14529/JSFI140101
Hajime Fujita, Kamil Iskra, Pavan Balaji, Andrew A. Chien, Empirical Comparison of Three Versioning Architectures international conference on cluster computing. pp. 456- 459 ,(2015) , 10.1109/CLUSTER.2015.69
E. Strohmaier, Hogzhang Shan, Architecture independent performance characterization and benchmarking for scientific applications modeling, analysis, and simulation on computer and telecommunication systems. pp. 467- 474 ,(2004) , 10.1109/MASCOT.2004.1348302
Bob Boothe, Efficient algorithms for bidirectional debugging Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation - PLDI '00. ,vol. 35, pp. 299- 310 ,(2000) , 10.1145/349299.349339
Guoming Lu, Ziming Zheng, Andrew A. Chien, When is multi-version checkpointing needed? Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale. pp. 49- 56 ,(2013) , 10.1145/2465813.2465821
Leonardo Bautista-Gomez, Seiji Tsuboi, Dimitri Komatitsch, Franck Cappello, Naoya Maruyama, Satoshi Matsuoka, FTI: high performance fault tolerance interface for hybrid systems ieee international conference on high performance computing data and analytics. ,vol. 32, pp. 32- ,(2011) , 10.1145/2063384.2063427
Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System ieee international conference on high performance computing data and analytics. pp. 1- 11 ,(2010) , 10.1109/SC.2010.18