Efficient Similarity Search in Very Large String Sets

作者： Dandy Fenz , Dustin Lange , Astrid Rheinländer , Felix Naumann , Ulf Leser

关键词:

摘要: String similarity search is required by many real-life applications, such as spell checking, data cleansing, fuzzy keyword search, or comparison of DNA sequences. Given a very large string set and query string, the problem to efficiently find all strings in that are similar string. Similarity defined using (or distance) measure, edit distance Hamming distance. In this paper, we introduce State Set Index (SSI) an efficient solution for problem. SSI based on trie (prefix index) interpreted nondeterministic finite automaton. SSI implements novel state labeling strategy making index highly space-efficient. Furthermore, SSI's space consumption can be gracefully traded against time. We evaluated different sets person names with up 170 million from social network compared it other state-of-the-art methods. We show majority cases, significantly faster than tools requires less space.

springer.com 本地加速

uni-trier.de 本地加速

springer.com 本地加速

springerlink.com 本地加速

doi.org 本地加速

sci-hub.st HTML 下载加速

参考文章(30)

Gösta Grahne, Jianfei Zhu, Efficiently Using Prefix-trees in Mining Frequent Itemsets. FIMI. ,(2003)

Dan Gusfield, Algorithms on Strings, Trees, and Sequences: Suffix Trees and Their Uses ,(1997) , 10.1017/CBO9780511574931

Erkki Sutinen, Ricardo A. Baeza-Yates, Jorma Tarhio, Gonzalo Navarro, Indexing methods for approximate string matching IEEE Data(base) Engineering Bulletin. ,vol. 24, pp. 19- 27 ,(2001)

Maxime Crochemore, Costas Iliopoulos, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, Extracting powers and periods in a string from its runs structure string processing and information retrieval. ,vol. 6393, pp. 258- 269 ,(2010) , 10.1007/978-3-642-16321-0_27

Ravindranath Jampani, Vikram Pudi, Using prefix-trees for efficiently computing set joins database systems for advanced applications. pp. 761- 772 ,(2005) , 10.1007/11408079_69

Astrid Rheinländer, Martin Knobloch, Nicky Hochmuth, Ulf Leser, Prefix tree indexing for similarity search and similarity joins on genomic data statistical and scientific database management. pp. 519- 536 ,(2010) , 10.1007/978-3-642-13818-8_36

Astrid Rheinländer, Ulf Leser, Scalable sequence similarity search and join in main memory on multi-cores international conference on parallel processing. pp. 13- 22 ,(2011) , 10.1007/978-3-642-29740-3_3

S. Alireza Aghili, Divyakant Agrawal, Amr El Abbadi, BFT: Bit Filtration Technique for Approximate String Join in Biological Databases string processing and information retrieval. pp. 326- 340 ,(2003) , 10.1007/978-3-540-39984-1_25

Scientific and Statistical Database Management Lecture Notes in Computer Science. ,vol. 5566, ,(2009) , 10.1007/978-3-642-02279-1

10.

V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals Soviet physics. Doklady. ,vol. 10, pp. 707- 710 ,(1966)

Efficient Similarity Search in Very Large String Sets

来源期刊

我的账户

Efficient Similarity Search in Very Large String Sets

来源期刊

相似文章 10

我的账户