Scouts: Improving the Diagnosis Process Through Domain-customized Incident Routing

作者: Jiaqi Gao , Nofel Yaseen , Robert MacDavid , Felipe Vieira Frujeri , Vincent Liu

DOI: 10.1145/3387514.3405867

关键词: Domain (software engineering)Data centerService level objectiveRouting (electronic design automation)Production (economics)Cloud computingProcess (computing)Computer networkService (systems architecture)Computer science

摘要: Incident routing is critical for maintaining service level objectives in the cloud: time-to-diagnosis can increase by 10x due to mis-routings. Properly incidents challenging because of complexity today's data center (DC) applications and their dependencies. For instance, an application running on a VM might rely functioning host-server, remote-storage service, virtual physical network components. It hard any one team, rule-based system, or even machine learning solution fully learn solve incident problem. We propose different approach using per-team Scouts. Each teams' Scout acts as its gate-keeper --- it routes relevant team routes-away unrelated ones. problem through collection these Our PhyNet alone currently deployed production reduces time-to-mitigation 65% mis-routed our dataset.

参考文章(62)
Pavel Laskov, Patrick Düssel, Christin Schäfer, Konrad Rieck, Learning intrusion detection: supervised or unsupervised? international conference on image analysis and processing. pp. 50- 57 ,(2005) , 10.1007/11553595_6
Cyril Goutte, Eric Gaussier, A probabilistic interpretation of precision, recall and F -score, with implication for evaluation european conference on information retrieval. pp. 345- 359 ,(2005) , 10.1007/978-3-540-31865-1_25
Jennifer Rexford, Dave Maltz, Srikanth Kandula, Changhoon Kim, Minlan Yu, Lihua Yuan, Albert Greenberg, Profiling network performance for multi-tier data center applications networked systems design and implementation. pp. 57- 70 ,(2011) , 10.5555/1972457.1972464
M. L. Schoffstall, M. Fedor, J. Davin, J. D. Case, Simple Network Management Protocol (SNMP) RFC. ,vol. 1098, pp. 1- 34 ,(1989)
Matt Mathis, John Heffner, Peter O’Neil, Pete Siemsen, Pathdiag: automated TCP diagnosis passive and active network measurement. pp. 152- 161 ,(2008) , 10.1007/978-3-540-79232-1_16
T. Friedman, J. Horowitz, D. Towsley, T. Turletti, N.G. Duffield, V. Arya, R. Bellino, Network tomography from aggregate loss reports Performance Evaluation. ,vol. 62, pp. 147- 163 ,(2005) , 10.1016/J.PEVA.2005.07.022
Ripon K Saha, Matthew Lease, Sarfraz Khurshid, Dewayne E Perry, None, Improving bug localization using structured information retrieval automated software engineering. pp. 345- 355 ,(2013) , 10.1109/ASE.2013.6693093
David Hovemeyer, William Pugh, Finding bugs is easy conference on object-oriented programming systems, languages, and applications. pp. 132- 136 ,(2004) , 10.1145/1028664.1028717
Keith Winstein, Hari Balakrishnan, TCP ex machina: computer-generated congestion control acm special interest group on data communication. ,vol. 43, pp. 123- 134 ,(2013) , 10.1145/2486001.2486020
Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, Hua Chen, Zhi-Wei Lin, Varugis Kurien, Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis acm special interest group on data communication. ,vol. 45, pp. 139- 152 ,(2015) , 10.1145/2785956.2787496