Exploring event correlation for failure prediction in coalitions of clusters

作者: Song Fu , Cheng-Zhong Xu

DOI: 10.1145/1362622.1362678

关键词:

摘要: In large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. events in coalition systems exhibit strong correlations time and space domain. this paper, we develop spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation stochastic describe spatial correlation. We further utilize information application allocation discover more among failure instances. cluster based on their predict future occurrences. implemented framework, called PREdictor Events Correlated Temporal-Spatially (hPREFECTs), which explores forecasts time-between-failure evaluate performance hPREFECTs both offline by using Los Alamos HPC traces online institute-wide clusters environment. Experimental results show system achieves than 76% accuracy 70% during from May 2006 April 2007.

参考文章(34)
Jayanth Srinivasan, Jude A. Rivers, Pradip Bose, Sarita V. Adve, A Reliability Odometer - Lemon Check Your Processor! ,(2004)
D. Tang, R.K. Iyer, S.S. Subramani, Failure analysis and modeling of a VAXcluster system [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium. pp. 244- 251 ,(1990) , 10.1109/FTCS.1990.89372
B. Schroeder, G.A. Gibson, A large-scale study of failures in high-performance computing systems dependable systems and networks. pp. 249- 258 ,(2006) , 10.1109/DSN.2006.5
Dong Tang, Ravishankar K. Iyer, Impact of Correlated Failures on Dependability in a VAXcluster System Springer, Vienna. pp. 175- 194 ,(1992) , 10.1007/978-3-7091-9198-9_9
Marvin Theimer, Dejan Kostić, John Dunagan, Alec Wolman, Michael B. Jones, Nicholas J. A. Harvey, FUSE: lightweight guaranteed distributed failure notification operating systems design and implementation. pp. 11- 11 ,(2004)
James W. Mickens, Brian D. Noble, Exploiting availability prediction in distributed systems networked systems design and implementation. pp. 6- 6 ,(2006)
Taliver Heath, Richard P. Martin, Thu D. Nguyen, Improving cluster availability using workstation validation measurement and modeling of computer systems. ,vol. 30, pp. 217- 227 ,(2002) , 10.1145/511334.511362
S. Mourad, D. Andrews, On the Reliability of the IBM MVS/XA Operating System IEEE Transactions on Software Engineering. ,vol. 13, pp. 1135- 1139 ,(1987) , 10.1109/TSE.1987.232855
A. G. Ganek, T. A. Corbi, The dawning of the autonomic computing era Ibm Systems Journal. ,vol. 42, pp. 5- 18 ,(2003) , 10.1147/SJ.421.0005
J.O. Kephart, D.M. Chess, The vision of autonomic computing IEEE Computer. ,vol. 36, pp. 41- 50 ,(2003) , 10.1109/MC.2003.1160055