Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

作者: Ana Gainaru , Franck Cappello , Joshi Fullop , Stefan Trausan-Matu , William Kramer

DOI: 10.1145/2038633.2038637

关键词:

摘要: In this paper, we analyse messages generated by different HPC large-scale systems in order to extract sequences of correlated events which lately use predict the normal and faulty behaviour system. Our method uses a dynamic window strategy that is able find frequent regardless on time delay between them. Most current related research narrows correlation extraction fixed relatively small windows do not reflect whole The are constant change during lifetime machine. We consider it important update at runtime applying modifications after each prediction phase according forecast's accuracy difference what was expected really happened. experiments show our analysing system around 60% with precision 85% lower event granularity than before.

参考文章(20)
Hans Meuer, E. Strohmaier, J. Dongarra, Horst Simon, Top500 Supercomputer Sites University of Tennessee. ,(1997)
Ana Gainaru, Franck Cappello, Stefan Trausan-Matu, Bill Kramer, Event log mining tool for large scale HPC systems international conference on parallel processing. pp. 52- 64 ,(2011) , 10.1007/978-3-642-23400-2_6
B. Schroeder, G.A. Gibson, A large-scale study of failures in high-performance computing systems dependable systems and networks. pp. 249- 258 ,(2006) , 10.1109/DSN.2006.5
Nezih Yigitbasi, Matthieu Gallet, Derrick Kondo, Alexandru Iosup, Dick Epema, Analysis and modeling of time-correlated failures in large-scale distributed systems grid computing. pp. 65- 72 ,(2010) , 10.1109/GRID.2010.5697961
Jian-Guang Lou, Qiang Fu, Yi Wang, Jiang Li, Mining dependency in distributed systems through unstructured logs analysis ACM SIGOPS Operating Systems Review. ,vol. 44, pp. 91- 96 ,(2010) , 10.1145/1740390.1740411
Ziming Zheng, Zhiling Lan, Rinku Gupta, Susan Coghlan, Pete Beckman, A practical failure prediction with location and lead time for Blue Gene/P dependable systems and networks. pp. 15- 22 ,(2010) , 10.1109/DSNW.2010.5542627
R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, A. Sivasubramaniam, Critical event prediction for proactive management in large-scale computer clusters knowledge discovery and data mining. pp. 426- 435 ,(2003) , 10.1145/956750.956799
A. Gara, M. A. Blumrich, D. Chen, G. L.-T. Chiu, P. Coteus, M. E. Giampapa, R. A. Haring, P. Heidelberger, D. Hoenicke, G. V. Kopcsay, T. A. Liebsch, M. Ohmacht, B. D. Steinmacher-Burow, T. Takken, P. Vranas, Overview of the Blue Gene/L system architecture Ibm Journal of Research and Development. ,vol. 49, pp. 195- 212 ,(2005) , 10.1147/RD.492.0195
Eric Heien, Derrick Kondo, Ana Gainaru, Dan LaPine, Bill Kramer, Franck Cappello, Modeling and tolerating heterogeneous failures in large parallel systems ieee international conference on high performance computing data and analytics. pp. 45- ,(2011) , 10.1145/2063384.2063444
Adam Oliner, Jon Stearley, What Supercomputers Say: A Study of Five System Logs dependable systems and networks. pp. 575- 584 ,(2007) , 10.1109/DSN.2007.103