MIST: Top-k Approximate Sub-string Mining Using Triplet Statistical Significance

关键词:

摘要: Efficient extraction of strings or sub-strings similar to an input query string forms a necessity in applications like instant search, record linkage, etc., where the similarity between two is usually quantified by edit distance. This paper proposes novel top-k approximate sub-string matching algorithm, MIST, for given query, based on Chi-squared statistical significance triplets, thereby avoiding expensive distance computation. Experiments with real-life data validate run-time effectiveness and accuracy our algorithm.

uni-trier.de 本地加速

参考文章(22)

Silviu Cucerzan, Eric Brill, Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users. empirical methods in natural language processing. pp. 293- 300 ,(2004)

Dandy Fenz, Dustin Lange, Astrid Rheinländer, Felix Naumann, Ulf Leser, Efficient Similarity Search in Very Large String Sets Lecture Notes in Computer Science. pp. 262- 279 ,(2012) , 10.1007/978-3-642-31235-9_18

Berthier Ribeiro-Neto, Ricardo Baeza-Yates, Modern Information Retrieval: The Concepts and Technology Behind Search ,(2011)

Sourav Dutta, Arnab Bhattacharya, Most significant substring mining based on chi-square measure knowledge discovery and data mining. pp. 319- 327 ,(2010) , 10.1007/978-3-642-13657-3_35

Masaru Kitsuregawa, Zhenglu Yang, Jianjun Yu, Fast algorithms for top-k approximate string matching national conference on artificial intelligence. pp. 1467- 1473 ,(2010)

Scientific and Statistical Database Management Lecture Notes in Computer Science. ,vol. 5566, ,(2009) , 10.1007/978-3-642-02279-1

V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals Soviet physics. Doklady. ,vol. 10, pp. 707- 710 ,(1966)

Eamonn Keogh, Stefano Lonardi, Bill 'Yuan-chi' Chiu, Finding surprising patterns in a time series database in linear time and space Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 550- 556 ,(2002) , 10.1145/775047.775128

Gonzalo Navarro, A guided tour to approximate string matching ACM Computing Surveys. ,vol. 33, pp. 31- 88 ,(2001) , 10.1145/375360.375365

10.

Karen Kukich, Techniques for automatically correcting words in text ACM Computing Surveys. ,vol. 24, pp. 377- 439 ,(1992) , 10.1145/146370.146380

MIST: Top-k Approximate Sub-string Mining Using Triplet Statistical Significance

来源期刊

我的账户

MIST: Top-k Approximate Sub-string Mining Using Triplet Statistical Significance

来源期刊

相似文章 1

Neighbor-Aware Search for Approximate Labeled Graph Matching using the Chi-Square Statistics

我的账户