作者: Song Fu , Cheng-Zhong Xu
关键词:
摘要: In large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. events in coalition systems exhibit strong correlations time and space domain. this paper, we develop spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation stochastic describe spatial correlation. We further utilize information application allocation discover more among failure instances. cluster based on their predict future occurrences. implemented framework, called PREdictor Events Correlated Temporal-Spatially (hPREFECTs), which explores forecasts time-between-failure evaluate performance hPREFECTs both offline by using Los Alamos HPC traces online institute-wide clusters environment. Experimental results show system achieves than 76% accuracy 70% during from May 2006 April 2007.