作者: Erik Kruus , Cristian Ungureanu , Cezary Dubnicki
关键词: Redundancy (engineering) 、 Computer science 、 Duplicate content 、 Distributed data store 、 Backup 、 Scalability 、 Data mining 、 Standard algorithms 、 Data deduplication 、 Data stream
摘要: Data deduplication has become a popular technology for reducing the amount of storage space necessary backup and archival data. Content defined chunking (CDC) techniques are well established methods separating data stream into variable-size chunks such that duplicate content good chance being discovered irrespective its position in stream. Requirements CDC include fast scalable operation, as achieving elimination. While latter can be achieved by using small average size, this also increases metadata to store relatively more numerous chunks, impacts negatively system's performance. We propose new approach achieves comparable elimination while larger size. It involves two chunk size targets, mechanisms dynamically switch between based on querying already stored; we use limited regions transition from nonduplicate data, elsewhere large chunks. The algorithms rely block store's ability quickly deliver high-quality reply existence queries already-stored blocks. A decision is made with lookahead number queries. present results running these actual four sets source code archives. Our typically achieve similar standard 2-4 times large. Such approaches may particularly interesting distributed systems redundancy (such error-correcting codes) requiring multiple fragments, which overheads per stored high. find algorithm variants flexibility location yield better elimination, at cost higher