Bimodal content defined chunking for backup streams

作者: Erik Kruus , Cristian Ungureanu , Cezary Dubnicki

DOI: 10.5555/1855511.1855529

关键词: Redundancy (engineering)Computer scienceDuplicate contentDistributed data storeBackupScalabilityData miningStandard algorithmsData deduplicationData stream

摘要: Data deduplication has become a popular technology for reducing the amount of storage space necessary backup and archival data. Content defined chunking (CDC) techniques are well established methods separating data stream into variable-size chunks such that duplicate content good chance being discovered irrespective its position in stream. Requirements CDC include fast scalable operation, as achieving elimination. While latter can be achieved by using small average size, this also increases metadata to store relatively more numerous chunks, impacts negatively system's performance. We propose new approach achieves comparable elimination while larger size. It involves two chunk size targets, mechanisms dynamically switch between based on querying already stored; we use limited regions transition from nonduplicate data, elsewhere large chunks. The algorithms rely block store's ability quickly deliver high-quality reply existence queries already-stored blocks. A decision is made with lookahead number queries. present results running these actual four sets source code archives. Our typically achieve similar standard 2-4 times large. Such approaches may particularly interesting distributed systems redundancy (such error-correcting codes) requiring multiple fragments, which overheads per stored high. find algorithm variants flexibility location yield better elimination, at cost higher

参考文章(34)
Jerzy Szczepkowski, Michal Welnicki, Lukasz Heldt, Wojciech Kilian, Cristian Ungureanu, Michal Kaczmarczyk, Przemyslaw Strzelczak, Cezary Dubnicki, Leszek Gryz, HYDRAstor: a Scalable Secondary Storage file and storage technologies. pp. 197- 210 ,(2009)
Kave Eshghi, Mark Lillibridge, Deepavali Bhagwat, Peter Camble, Vinay Deolalikar, Greg Trezise, Sparse indexing: large scale, inline deduplication using sampling and locality file and storage technologies. pp. 111- 123 ,(2009)
Fred Douglis, Arun Iyengar, Application-specific Delta-encoding via Resemblance Detection. usenix annual technical conference. pp. 113- 126 ,(2003)
Timothy E. Denehy, Windsor W. Hsu, Duplicate Management for Reference Data ,(2004)
Christos T. Karamanolis, Lawrence You, Evaluation of Efficient Archival Storage Techniques. MSST. pp. 227- 232 ,(2004)
Kai Li, Hugo Patterson, Benjamin Zhu, Avoiding the disk bottleneck in the data domain deduplication file system file and storage technologies. pp. 18- ,(2008)
Paul Mackerras, Andrew Tridgell, The rsync algorithm The Australian National University. ,(1996)
Calicrates Policroniades, Ian Pratt, Alternatives for detecting redundancy in storage systems data usenix annual technical conference. pp. 6- 6 ,(2004)
Hanping Lufei, Weisong Shi, Lucia Zamorano, On the Effects of Bandwidth Reduction Techniques in Distributed Applications embedded and ubiquitous computing. pp. 796- 806 ,(2004) , 10.1007/978-3-540-30121-9_76
Andrei Z. Broder, Some applications of Rabin’s fingerprinting method Springer, New York, NY. pp. 143- 152 ,(1993) , 10.1007/978-1-4613-9323-8_11