Most significant substring mining based on chi-square measure

作者: Sourav Dutta , Arnab Bhattacharya

DOI: 10.1007/978-3-642-13657-3_35

关键词: HeuristicsData miningString (computer science)Measure (mathematics)Data sequencesTheoretical computer scienceSubstringMathematicsIntrusion detection systemChi-square test

摘要: Given the vast reservoirs of sequence data stored worldwide, efficient mining string databases such as intrusion detection systems, player statistics, texts, proteins, etc. has emerged a great challenge. Searching for an unusual pattern within long strings requirement diverse applications. string, problem then is to identify substrings that differs most from expected or normal behavior, i.e., are statistically significant (i.e., less likely occur due chance alone). To this end, we use chi-square measure and propose two heuristics retrieving top-k with largest measure. We show algorithms outperform other competing in runtime, while maintaining high approximation ratio more than 0.96.

参考文章(11)
Noel Cressie, Timothy R. C. Read, Pearson's X 2 and the Loglikelihood Ratio Statistic G 2 : A Comparative Review International Statistical Review / Revue Internationale de Statistique. ,vol. 57, pp. 19- 43 ,(1989) , 10.2307/1403582
Alain Denise, Mireille Régnier, Mathias Vandenbogaert, Assessing the Statistical Significance of Overrepresented Oligonucleotides workshop on algorithms in bioinformatics. pp. 85- 97 ,(2001) , 10.1007/3-540-44696-6_7
Sven Rahmann, Dynamic programming algorithms for two statistical problems in computational biology. workshop on algorithms in bioinformatics. ,vol. 2812, pp. 151- 164 ,(2003) , 10.1007/978-3-540-39763-2_12
Sourav Dutta, Arnab Bhattacharya, Mining Statistically Significant Substrings Based on the Chi-Square Measure arXiv: Databases. pp. 1599- 1608 ,(2010) , 10.4018/978-1-4666-3604-0.CH083
Nong Ye, Qiang Chen, An anomaly detection technique based on a chi‐square statistic for detecting intrusions into information systems Quality and Reliability Engineering International. ,vol. 17, pp. 105- 112 ,(2001) , 10.1002/QRE.392
MIREILLE RÉGNIER, MATHIAS VANDENBOGAERT, COMPARISON OF STATISTICAL SIGNIFICANCE CRITERIA Journal of Bioinformatics and Computational Biology. ,vol. 4, pp. 537- 551 ,(2006) , 10.1142/S0219720006002028
Eamonn Keogh, Stefano Lonardi, Bill 'Yuan-chi' Chiu, Finding surprising patterns in a time series database in linear time and space Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 550- 556 ,(2002) , 10.1145/775047.775128
Gill Bejerano, Nir Friedman, Naftali Tishby, Efficient exact p-value computation for small sample, sparse, and surprising categorical data. Journal of Computational Biology. ,vol. 11, pp. 867- 886 ,(2004) , 10.1089/CMB.2004.11.867
Noel A. C. Cressie, Timothy R. C. Read, Goodness-of-Fit Statistics for Discrete Multivariate Data ,(1988)
H. Hotelling, Multivariate Quality Control Techniques of Statistical Analysis. ,(1947)