Towards intelligent incident management: why we need it and how we make it

作者: Zhuangbin Chen , Yu Kang , Liqun Li , Xu Zhang , Hongyu Zhang

DOI: 10.1145/3368089.3417055

关键词:

摘要: The management of cloud service incidents (unplanned interruptions or outages a service/product) greatly affects customer satisfaction and business revenue. After years efforts, enterprises are able to solve most automatically timely. However, in practice, we still observe critical that occurred an unexpected manner orchestrated diagnosis workflow failed mitigate them. In order accelerate the understanding unprecedented provide actionable recommendations, modern incident system employs strategy AIOps (Artificial Intelligence for IT Operations). this paper, broad view industrial understand system, conduct comprehensive empirical study spanning over two practices at Microsoft. Particularly, identify challenges (namely, incomplete service/resource dependencies imprecise resource health assessment) investigate underlying reasons from perspective design operations. We also present IcM BRAIN, our framework towards intelligent management, show its practical benefits conveyed services

参考文章(28)
Yoon Kim, Convolutional Neural Networks for Sentence Classification empirical methods in natural language processing. pp. 1746- 1751 ,(2014) , 10.3115/V1/D14-1181
Victor Ion Munteanu, Andrew Edmonds, Thomas M Bohnert, Teodor-Florin Fortis, None, Cloud Incident Management, Challenges, Research Directions, and Architectural Approach ieee acm international conference utility and cloud computing. pp. 786- 791 ,(2014) , 10.1109/UCC.2014.128
Rui Ding, Qiang Fu, Jian Guang Lou, Qingwei Lin, Dongmei Zhang, Tao Xie, Mining Historical Issue Repositories to Heal Large-Scale Online Service Systems dependable systems and networks. pp. 311- 322 ,(2014) , 10.1109/DSN.2014.39
Jian-Guang Lou, Qiang Fu, Yi Wang, Jiang Li, Mining dependency in distributed systems through unstructured logs analysis ACM SIGOPS Operating Systems Review. ,vol. 44, pp. 91- 96 ,(2010) , 10.1145/1740390.1740411
Jian-Guang Lou, Qingwei Lin, Rui Ding, Qiang Fu, Dongmei Zhang, Tao Xie, Software analytics for incident management of online services: an experience report automated software engineering. pp. 475- 485 ,(2013) , 10.1109/ASE.2013.6693105
Qiang Fu, Jian-Guang Lou, Yi Wang, Jiang Li, Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis international conference on data mining. pp. 149- 158 ,(2009) , 10.1109/ICDM.2009.60
Shen-Shyang Ho, H Wechsler, A Martingale Framework for Detecting Changes in Data Streams by Testing Exchangeability IEEE Transactions on Pattern Analysis and Machine Intelligence. ,vol. 32, pp. 2113- 2127 ,(2010) , 10.1109/TPAMI.2010.48
Sujoy Basu, Fabio Casati, Florian Daniel, Toward Web Service Dependency Discovery for SOA Management ieee international conference on services computing. ,vol. 2, pp. 422- 429 ,(2008) , 10.1109/SCC.2008.45
Marcello Cinque, Domenico Cotroneo, Raffaele Della Corte, Antonio Pecchia, What Logs Should You Look at When an Application Fails? Insights from an Industrial Case Study dependable systems and networks. pp. 690- 695 ,(2014) , 10.1109/DSN.2014.69