Fault-Tolerance and High Availability in Data Stream Management Systems.

作者: Magdalena Balazinska , Mehul A. Shah , Jeong-Hyon Hwang

DOI:

关键词:

摘要: DEFINITION Just like any other software system, a data stream management system (DSMS) can experience failures of its different components. Failures are especially common in distributed DSMSs, where query operators spread across multiple processing nodes, i.e., independent processes typically running on physical machines local-area network (LAN) or wide-area (WAN). nodes the underlying communication cause continuous queries (CQ) DSMS to stall produce erroneous results. These adversely affect critical client applications relying these queries. Traditionally, availability has been defined as fraction time that remains operational and properly servicing requests. In however, often also incorporates end-to-end latencies need quickly react real-time events thus tolerate only small delays. A handle using variety techniques offer levels depending application needs. All fault-tolerance methods rely some form replication, volatile state is stored multiple, locations protect against failures. This article describes several such trade-offs between runtime overhead while maintaining consistency. For cases partitions, it outlines avoid stalling at cost temporary inconsistency, thereby providing highest availability. focuses within does not discuss sources applications.

参考文章(20)
A. Schiper, S. Toueg, From set membership to group membership: a separation of concerns IEEE Transactions on Dependable and Secure Computing. ,vol. 3, pp. 2- 12 ,(2006) , 10.1109/TDSC.2006.13
Jeong-Hyon Hwang, Ying Xing, Ugur Cetintemel, Stan Zdonik, A Cooperative, Self-Configuring High-Availability Solution for Stream Processing international conference on data engineering. pp. 176- 185 ,(2007) , 10.1109/ICDE.2007.367863
Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, David B Johnson, A survey of rollback-recovery protocols in message-passing systems ACM Computing Surveys. ,vol. 34, pp. 375- 408 ,(2002) , 10.1145/568522.568525
Jim Gray, Pat Helland, Patrick O'Neil, Dennis Shasha, The dangers of replication and a solution international conference on management of data. ,vol. 25, pp. 173- 182 ,(1996) , 10.1145/233269.233330
D. B. Terry, M. M. Theimer, Karin Petersen, A. J. Demers, M. J. Spreitzer, C. H. Hauser, Managing update conflicts in Bayou, a weakly connected replicated storage system symposium on operating systems principles. ,vol. 29, pp. 172- 182 ,(1995) , 10.1145/224056.224070
E.A. Brewer, Lessons from giant-scale services IEEE Internet Computing. ,vol. 5, pp. 46- 55 ,(2001) , 10.1109/4236.939450
Magdalena Balazinska, Hari Balakrishnan, Samuel Madden, Michael Stonebraker, Fault-tolerance in the Borealis distributed stream processing system Proceedings of the 2005 ACM SIGMOD international conference on Management of data - SIGMOD '05. pp. 13- 24 ,(2005) , 10.1145/1066157.1066160
Jim Gray, WHY DO COMPUTERS STOP AND WHAT CAN BE DONE ABOUT IT Symposium on Reliability in Distributed Software and Database Systems. pp. 3- 12 ,(1985)
Mehul A. Shah, Joseph M. Hellerstein, Eric Brewer, Highly available, fault-tolerant, parallel dataflows international conference on management of data. pp. 827- 838 ,(2004) , 10.1145/1007568.1007662
Fred B. Schneider, Implementing fault-tolerant services using the state machine approach: a tutorial ACM Computing Surveys. ,vol. 22, pp. 299- 319 ,(1990) , 10.1145/98163.98167