Efficient Similarity Search in Very Large String Sets

作者: Dandy Fenz , Dustin Lange , Astrid Rheinländer , Felix Naumann , Ulf Leser

DOI: 10.1007/978-3-642-31235-9_18

关键词:

摘要: String similarity search is required by many real-life applications, such as spell checking, data cleansing, fuzzy keyword search, or comparison of DNA sequences. Given a very large string set and query string, the problem to efficiently find all strings in that are similar string. Similarity defined using (or distance) measure, edit distance Hamming distance. In this paper, we introduce State Set Index (SSI) an efficient solution for problem. SSI based on trie (prefix index) interpreted nondeterministic finite automaton. SSI implements novel state labeling strategy making index highly space-efficient. Furthermore, SSI's space consumption can be gracefully traded against time. We evaluated different sets person names with up 170 million from social network compared it other state-of-the-art methods. We show majority cases, significantly faster than tools requires less space.

参考文章(30)
Gösta Grahne, Jianfei Zhu, Efficiently Using Prefix-trees in Mining Frequent Itemsets. FIMI. ,(2003)
Erkki Sutinen, Ricardo A. Baeza-Yates, Jorma Tarhio, Gonzalo Navarro, Indexing methods for approximate string matching IEEE Data(base) Engineering Bulletin. ,vol. 24, pp. 19- 27 ,(2001)
Maxime Crochemore, Costas Iliopoulos, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, Extracting powers and periods in a string from its runs structure string processing and information retrieval. ,vol. 6393, pp. 258- 269 ,(2010) , 10.1007/978-3-642-16321-0_27
Ravindranath Jampani, Vikram Pudi, Using prefix-trees for efficiently computing set joins database systems for advanced applications. pp. 761- 772 ,(2005) , 10.1007/11408079_69
Astrid Rheinländer, Martin Knobloch, Nicky Hochmuth, Ulf Leser, Prefix tree indexing for similarity search and similarity joins on genomic data statistical and scientific database management. pp. 519- 536 ,(2010) , 10.1007/978-3-642-13818-8_36
Astrid Rheinländer, Ulf Leser, Scalable sequence similarity search and join in main memory on multi-cores international conference on parallel processing. pp. 13- 22 ,(2011) , 10.1007/978-3-642-29740-3_3
S. Alireza Aghili, Divyakant Agrawal, Amr El Abbadi, BFT: Bit Filtration Technique for Approximate String Join in Biological Databases string processing and information retrieval. pp. 326- 340 ,(2003) , 10.1007/978-3-540-39984-1_25
Scientific and Statistical Database Management Lecture Notes in Computer Science. ,vol. 5566, ,(2009) , 10.1007/978-3-642-02279-1
V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals Soviet physics. Doklady. ,vol. 10, pp. 707- 710 ,(1966)