Online failure prediction in cloud datacenters by real-time message pattern learning

作者: Yukihiro Watanabe , Hiroshi Otsuka , Masataka Sonoda , Shinji Kikuchi , Yasuhide Matsumoto

DOI: 10.1109/CLOUDCOM.2012.6427566

关键词:

摘要: Once failures occur in a cloud datacenter accommodating large number of virtual resources, they tend to spread rapidly and widely, impacting on many users (tenant owners). One the best ways prevent failure from spreading system is identifying signs before its occurrence deal with it proactively causes serious problems. Although several approaches have been proposed predict by analyzing past message logs relationship between messages failures, still difficult automatically for reasons such as various types log formats or time gaps pattern learning application identified patterns real systems. Based this understanding, we propose new prediction method paper which learns classifying their similarity without depending format re-Iearning frequently-changed configurations. We implemented our evaluated using data recorded an actual datacenter. The experimental result shows that approach predicted 80% precision covered 90% occurrences.

参考文章(9)
Ana Gainaru, Franck Cappello, Joshi Fullop, Stefan Trausan-Matu, William Kramer, Adaptive event prediction strategy with dynamic time window for large-scale HPC systems Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques. pp. 4- ,(2011) , 10.1145/2038633.2038637
Wei Xu, Ling Huang, Armando Fox, David Patterson, Michael I Jordan, None, Detecting large-scale system problems by mining console logs symposium on operating systems principles. pp. 117- 132 ,(2009) , 10.1145/1629575.1629587
M. Sonoda, Y. Watanabe, Y. Matsumoto, Prediction of failure occurrence time based on system log message pattern learning network operations and management symposium. pp. 578- 581 ,(2012) , 10.1109/NOMS.2012.6211960
Felix Salfner, Miroslaw Malek, Using Hidden Semi-Markov Models for Effective Online Failure Prediction symposium on reliable distributed systems. pp. 161- 174 ,(2007) , 10.1109/SRDS.2007.35
Alexander Clemm, Malte Hartwig, NETradamus: A forecasting system for system event messages network operations and management symposium. pp. 623- 630 ,(2010) , 10.1109/NOMS.2010.5488430
Matei Zaharia, Ariel Rabkin, Michael Armbrust, David A. Patterson, Andrew Konwinski, Anthony D. Joseph, Gunho Lee, Ion Stoica, Randy H. Katz, Armando Fox, Rean Griffith, Above the Clouds: A Berkeley View of Cloud Computing Science. ,vol. 53, pp. 07- 013 ,(2009)
Li Yu, Ziming Zheng, Zhiling Lan, Susan Coghlan, Practical online failure prediction for Blue Gene/P: Period-based vs event-driven dependable systems and networks. pp. 259- 264 ,(2011) , 10.1109/DSNW.2011.5958823
Jiexing Gu, Ziming Zheng, Zhiling Lan, John White, Eva Hocks, Byung-Hoon Park, Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study international conference on parallel processing. pp. 157- 164 ,(2008) , 10.1109/ICPP.2008.17
Felix Salfner, Maren Lenk, Miroslaw Malek, A survey of online failure prediction methods ACM Computing Surveys. ,vol. 42, pp. 10- ,(2010) , 10.1145/1670679.1670680