Mining statistically significant substrings using the chi-square statistic

作者： Mayank Sachan , Arnab Bhattacharya

关键词: Statistic 、 Mathematics 、 Randomness 、 String (computer science) 、 Set (abstract data type) 、 Statistics 、 Empirical distribution function 、 Pearson's chi-squared test 、 Substring index 、 Substring

摘要: The problem of identification statistically significant patterns in a sequence data has been applied to many domains such as intrusion detection systems, financial models, web-click records, automated monitoring computational biology, cryptology, and text analysis. An observed pattern events is deemed be if it unlikely have occurred due randomness or chance alone. We use the chi-square statistic quantitative measure statistical significance. Given string characters generated from memoryless Bernoulli model, identify substring for which empirical distribution single letters deviates most expected generative model. This deviation captured using measure. (MSS) thus defined having highest value. Till date, best our knowledge, there does not exist any algorithm find MSS better than O(n2) time, where n denotes length string. In this paper, we propose an substring, whose running time O(n3/2) with high probability. also study some variants finding top-t set, all substrings greater fixed threshold among given length. experimentally demonstrate asymptotic behavior on varying size alphabet size. describe applications cryptology real world finance sports. Finally, compare technique existing heuristics MSS.

参考文章(48)

Heikki Mannila, A. Inkeri Verkamo, Hannu Toivonen, Discovering Frequent Episodes in Sequences. knowledge discovery and data mining. pp. 210- 215 ,(1995)

Rohitha Goonatilake, Susantha Herath, Jayantha Herath, Suvineetha Herath, Ajantha Herath, Intrusion detection using the chi-square goodness-of-fit test for information assurance, network, forensics and software security Journal of Computing Sciences in Colleges. ,vol. 23, pp. 255- 263 ,(2007)

A. V. Goldberg, Finding a Maximum Density Subgraph University of California at Berkeley. ,(1984)

Guy Kortsarz, David Peleg, On Choosing a Dense Subgraph (Extended Abstract) foundations of computer science. pp. 692- 701 ,(1993)

Harvey Frommer, Frederic J. Frommer, Red Sox vs. Yankees: The Great Rivalry ,(2004)

Sourav Dutta, Arnab Bhattacharya, Most significant substring mining based on chi-square measure knowledge discovery and data mining. pp. 319- 327 ,(2010) , 10.1007/978-3-642-13657-3_35

Thierry Lecroq, Maxime Crochemore, Christophe Hancart, Algorithms on Strings ,(2007)

M. Atallah, R. Gwadera, W. Szpankowski, Detection of significant sets of episodes in event sequences international conference on data mining. pp. 3- 10 ,(2004) , 10.1109/ICDM.2004.10090

Alain Denise, Mireille Régnier, Mathias Vandenbogaert, Assessing the Statistical Significance of Overrepresented Oligonucleotides workshop on algorithms in bioinformatics. pp. 85- 97 ,(2001) , 10.1007/3-540-44696-6_7

10.

Moses Charikar, Greedy approximation algorithms for finding dense components in a graph Lecture Notes in Computer Science. pp. 84- 95 ,(2000) , 10.1007/3-540-44436-X_10

Mining statistically significant substrings using the chi-square statistic

来源期刊

我的账户

Mining statistically significant substrings using the chi-square statistic

来源期刊

相似文章 7

Mining statistically significant connected subgraphs in vertex labeled graphs

Oracle Workload Intelligence

Impact of Inadequate Urban Planning on Municipal Solid Waste Management in the Niger Delta Region of Nigeria

Finding Surprisingly Frequent Patterns of Variable Lengths in Sequence Data

Neighbor-Aware Search for Approximate Labeled Graph Matching using the Chi-Square Statistics

RAQ: Relationship-Aware Graph Querying in Large Networks

RAQ: Relationship-Aware Graph Querying in Large Networks.

我的账户