A Survey on Failure Prediction of Large-Scale Server Clusters

作者: Hongmei Yang , Yongquan Liang , Lianshan Liu , Haibin Sun

DOI: 10.1109/SNPD.2007.106

关键词:

摘要: As the size and complexity of cluster systems grows, failure rates accelerate dramatically. To reduce disaster caused by failures, it is desirable to identify potential failures ahead their occurrence. In this paper, we survey state art in prediction systems. The characteristic are addressed, some statistic results shown. We explore ways collection preprocessing data for prediction, suggest a procedure records automatically generated log files. Focused on main idea five methods, including based threshold, time series analysis, rule-based classification, Bayesian network models semi-Markov process models, analyzed respectively. addition, concerning accuracy practicality, present metrics evaluating techniques compare with metrics.

参考文章(32)
Fares A. Nassar, Dorothy M. Andrews, A Methodology for Analysis of Failure Prediction Data. real-time systems symposium. pp. 160- 166 ,(1985)
Felix Salfner, Miroslaw Malek, Günther A. Hoffmann, Advanced Failure Prediction in Complex Software Systems Humboldt-Universität zu Berlin, Mathematisch-Naturwissenschaftliche Fakultät II, Institut für Informatik. ,(2004) , 10.18452/2500
Gurmeet Singh Manku, Rajeev Motwani, Chapter 31 – Approximate Frequency Counts over Data Streams very large data bases. pp. 346- 357 ,(2002) , 10.1016/B978-155860869-6/50038-X
Moses Charikar, Kevin Chen, Martin Farach-Colton, Finding Frequent Items in Data Streams international colloquium on automata languages and programming. ,vol. 312, pp. 693- 703 ,(2002) , 10.1016/S0304-3975(03)00400-6
ShuYun Wang, XiuLan Hao, HeXiang Xu, YunFa Hu, Finding Frequent Items in Data Streams Using ESBF Emerging Technologies in Knowledge Discovery and Data Mining. pp. 244- 255 ,(2007) , 10.1007/978-3-540-77018-3_26
Brian Randell, On Failures and Faults formal methods. pp. 18- 39 ,(2003) , 10.1007/978-3-540-45236-2_3
Ahmed Metwally, Divyakant Agrawal, Amr El Abbadi, Efficient Computation of Frequent and Top-k Elements in Data Streams Database Theory - ICDT 2005. pp. 398- 412 ,(2004) , 10.1007/978-3-540-30570-5_27
B. Schroeder, G.A. Gibson, A large-scale study of failures in high-performance computing systems dependable systems and networks. pp. 249- 258 ,(2006) , 10.1109/DSN.2006.5
Daniel Nurmi, John Brevik, Rich Wolski, Modeling Machine Availability in Enterprise and Wide-Area Distributed Computing Environments Euro-Par 2005 Parallel Processing. pp. 432- 441 ,(2005) , 10.1007/11549468_50
Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro, Frequency Estimation of Internet Packet Streams with Limited Space european symposium on algorithms. pp. 348- 360 ,(2002) , 10.1007/3-540-45749-6_33