Mining statistically significant substrings using the chi-square statistic

作者: Mayank Sachan , Arnab Bhattacharya

DOI: 10.14778/2336664.2336677

关键词: StatisticMathematicsRandomnessString (computer science)Set (abstract data type)StatisticsEmpirical distribution functionPearson's chi-squared testSubstring indexSubstring

摘要: The problem of identification statistically significant patterns in a sequence data has been applied to many domains such as intrusion detection systems, financial models, web-click records, automated monitoring computational biology, cryptology, and text analysis. An observed pattern events is deemed be if it unlikely have occurred due randomness or chance alone. We use the chi-square statistic quantitative measure statistical significance. Given string characters generated from memoryless Bernoulli model, identify substring for which empirical distribution single letters deviates most expected generative model. This deviation captured using measure. (MSS) thus defined having highest value. Till date, best our knowledge, there does not exist any algorithm find MSS better than O(n2) time, where n denotes length string. In this paper, we propose an substring, whose running time O(n3/2) with high probability. also study some variants finding top-t set, all substrings greater fixed threshold among given length. experimentally demonstrate asymptotic behavior on varying size alphabet size. describe applications cryptology real world finance sports. Finally, compare technique existing heuristics MSS.

参考文章(48)
Heikki Mannila, A. Inkeri Verkamo, Hannu Toivonen, Discovering Frequent Episodes in Sequences. knowledge discovery and data mining. pp. 210- 215 ,(1995)
Rohitha Goonatilake, Susantha Herath, Jayantha Herath, Suvineetha Herath, Ajantha Herath, Intrusion detection using the chi-square goodness-of-fit test for information assurance, network, forensics and software security Journal of Computing Sciences in Colleges. ,vol. 23, pp. 255- 263 ,(2007)
A. V. Goldberg, Finding a Maximum Density Subgraph University of California at Berkeley. ,(1984)
Guy Kortsarz, David Peleg, On Choosing a Dense Subgraph (Extended Abstract) foundations of computer science. pp. 692- 701 ,(1993)
Harvey Frommer, Frederic J. Frommer, Red Sox vs. Yankees: The Great Rivalry ,(2004)
Sourav Dutta, Arnab Bhattacharya, Most significant substring mining based on chi-square measure knowledge discovery and data mining. pp. 319- 327 ,(2010) , 10.1007/978-3-642-13657-3_35
Thierry Lecroq, Maxime Crochemore, Christophe Hancart, Algorithms on Strings ,(2007)
M. Atallah, R. Gwadera, W. Szpankowski, Detection of significant sets of episodes in event sequences international conference on data mining. pp. 3- 10 ,(2004) , 10.1109/ICDM.2004.10090
Alain Denise, Mireille Régnier, Mathias Vandenbogaert, Assessing the Statistical Significance of Overrepresented Oligonucleotides workshop on algorithms in bioinformatics. pp. 85- 97 ,(2001) , 10.1007/3-540-44696-6_7
Moses Charikar, Greedy approximation algorithms for finding dense components in a graph Lecture Notes in Computer Science. pp. 84- 95 ,(2000) , 10.1007/3-540-44436-X_10