作者: Jiaqi Gao , Nofel Yaseen , Robert MacDavid , Felipe Vieira Frujeri , Vincent Liu
关键词: Domain (software engineering) 、 Data center 、 Service level objective 、 Routing (electronic design automation) 、 Production (economics) 、 Cloud computing 、 Process (computing) 、 Computer network 、 Service (systems architecture) 、 Computer science
摘要: Incident routing is critical for maintaining service level objectives in the cloud: time-to-diagnosis can increase by 10x due to mis-routings. Properly incidents challenging because of complexity today's data center (DC) applications and their dependencies. For instance, an application running on a VM might rely functioning host-server, remote-storage service, virtual physical network components. It hard any one team, rule-based system, or even machine learning solution fully learn solve incident problem. We propose different approach using per-team Scouts. Each teams' Scout acts as its gate-keeper --- it routes relevant team routes-away unrelated ones. problem through collection these Our PhyNet alone currently deployed production reduces time-to-mitigation 65% mis-routed our dataset.