作者: Zhuangbin Chen , Yu Kang , Liqun Li , Xu Zhang , Hongyu Zhang
关键词:
摘要: The management of cloud service incidents (unplanned interruptions or outages a service/product) greatly affects customer satisfaction and business revenue. After years efforts, enterprises are able to solve most automatically timely. However, in practice, we still observe critical that occurred an unexpected manner orchestrated diagnosis workflow failed mitigate them. In order accelerate the understanding unprecedented provide actionable recommendations, modern incident system employs strategy AIOps (Artificial Intelligence for IT Operations). this paper, broad view industrial understand system, conduct comprehensive empirical study spanning over two practices at Microsoft. Particularly, identify challenges (namely, incomplete service/resource dependencies imprecise resource health assessment) investigate underlying reasons from perspective design operations. We also present IcM BRAIN, our framework towards intelligent management, show its practical benefits conveyed services