作者: Dandy Fenz , Dustin Lange , Astrid Rheinländer , Felix Naumann , Ulf Leser
DOI: 10.1007/978-3-642-31235-9_18
关键词:
摘要: String similarity search is required by many real-life applications, such as spell checking, data cleansing, fuzzy keyword search, or comparison of DNA sequences. Given a very large string set and query string, the problem to efficiently find all strings in that are similar string. Similarity defined using (or distance) measure, edit distance Hamming distance. In this paper, we introduce State Set Index (SSI) an efficient solution for problem. SSI based on trie (prefix index) interpreted nondeterministic finite automaton. SSI implements novel state labeling strategy making index highly space-efficient. Furthermore, SSI's space consumption can be gracefully traded against time. We evaluated different sets person names with up 170 million from social network compared it other state-of-the-art methods. We show majority cases, significantly faster than tools requires less space.