作者: Tarek F. Abdelzaher , Mohammad Maifi Hasan Khan
DOI:
关键词:
摘要: The term “interactive complexity” was introduced by Charles Perrow in his famous book Normal Accidents: Living with High-Risk Technologies [1]. He used the to describe interacting tendency of systems large number components. argued that, components, multiple failures often interact some unexpected way, leading catastrophic such as planes or nuclear power plants. also suggested that increasing interactive complexity and tight coupling, interactions are bound happen. Indeed, proliferation Internet enabled cheap embedded devices built sensors actuators (e.g., smart phones, appliances), physical world is increasingly becoming an integral part logical computation. As computing much more responsive surrounding environments, it difficult test full extent before deployment real world. Hence, due increased coupling between world, fail preform poorly once deployed life. Unintended among various system across environments blame for problem. With this growing trend, bugs arise interaction different distributed components nodes likely get worse, going affect reliability significantly. This calls new tools techniques troubleshoot future software systems. In dissertation, we address significant challenge troubleshooting emerging cyber-physical using data mining techniques. More specifically, applied discriminative sequence algorithm isolate chains events (not necessarily contiguous) causally correlated failure analyzing logs. In first our thesis, tool, successfully identified multi-channel MAC (medium access control) layer protocol wireless sensor network [2], kernel level race condition bug LiteOS operating system, corner case design flaw directed diffusion [3]. Next, extended approach identify “symbolic” patterns, where absolute values replaced abstract symbols whenever appropriate subtle patterns have examined applicability harmful may poor integration adaptive server clusters. “cyclic” center applications, which potentially highlights self-reinforcing loops. Finally, complement work on complexity, diagnosing occasional “lack interaction” system. Such caused unresponsive nodes. We develop tele-diagnostic powertracer, in-situ tool uses external measurements determine internal health host most cause its failure. Using distinguish several categories behavior including energy depletion, antenna damage, radio disconnection, crashes, anomalous reboots. To best knowledge, present a diagnostic diagnose remotely.