MIST: Top-k Approximate Sub-string Mining Using Triplet Statistical Significance

作者: Sourav Dutta

DOI: 10.1007/978-3-319-16354-3_31

关键词:

摘要: Efficient extraction of strings or sub-strings similar to an input query string forms a necessity in applications like instant search, record linkage, etc., where the similarity between two is usually quantified by edit distance. This paper proposes novel top-k approximate sub-string matching algorithm, MIST, for given query, based on Chi-squared statistical significance triplets, thereby avoiding expensive distance computation. Experiments with real-life data validate run-time effectiveness and accuracy our algorithm.

参考文章(22)
Silviu Cucerzan, Eric Brill, Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users. empirical methods in natural language processing. pp. 293- 300 ,(2004)
Dandy Fenz, Dustin Lange, Astrid Rheinländer, Felix Naumann, Ulf Leser, Efficient Similarity Search in Very Large String Sets Lecture Notes in Computer Science. pp. 262- 279 ,(2012) , 10.1007/978-3-642-31235-9_18
Berthier Ribeiro-Neto, Ricardo Baeza-Yates, Modern Information Retrieval: The Concepts and Technology Behind Search ,(2011)
Sourav Dutta, Arnab Bhattacharya, Most significant substring mining based on chi-square measure knowledge discovery and data mining. pp. 319- 327 ,(2010) , 10.1007/978-3-642-13657-3_35
Masaru Kitsuregawa, Zhenglu Yang, Jianjun Yu, Fast algorithms for top-k approximate string matching national conference on artificial intelligence. pp. 1467- 1473 ,(2010)
Scientific and Statistical Database Management Lecture Notes in Computer Science. ,vol. 5566, ,(2009) , 10.1007/978-3-642-02279-1
V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals Soviet physics. Doklady. ,vol. 10, pp. 707- 710 ,(1966)
Eamonn Keogh, Stefano Lonardi, Bill 'Yuan-chi' Chiu, Finding surprising patterns in a time series database in linear time and space Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 550- 556 ,(2002) , 10.1145/775047.775128
Gonzalo Navarro, A guided tour to approximate string matching ACM Computing Surveys. ,vol. 33, pp. 31- 88 ,(2001) , 10.1145/375360.375365
Karen Kukich, Techniques for automatically correcting words in text ACM Computing Surveys. ,vol. 24, pp. 377- 439 ,(1992) , 10.1145/146370.146380