Mining Statistically Significant Substrings Based on the Chi-Square Measure

作者: Sourav Dutta , Arnab Bhattacharya

DOI: 10.4018/978-1-4666-3604-0.CH083

关键词: Theoretical computer scienceSubstringMeasure (mathematics)Data miningChi-square testString (computer science)HeuristicsComputer sciencePoint (typography)

摘要: Given the vast reservoirs of data stored worldwide, efficient mining from a large information store has emerged as great challenge. Many databases like that intrusion detection systems, web-click records, player statistics, texts, proteins etc., strings or sequences. Searching for an unusual pattern within such long requirement diverse applications. string, problem then is to identify substrings differs most expected normal behavior, i.e., are statistically significant. In other words, these less likely occur due chance alone and may point some interesting phenomenon warrants further exploration. To this end, we use chi-square measure. We propose two heuristics retrieving top-k with largest show algorithms outperform competing in runtime, while maintaining high approximation ratio more than 0.96.

参考文章(11)
Noel Cressie, Timothy R. C. Read, Pearson's X 2 and the Loglikelihood Ratio Statistic G 2 : A Comparative Review International Statistical Review / Revue Internationale de Statistique. ,vol. 57, pp. 19- 43 ,(1989) , 10.2307/1403582
Sourav Dutta, Arnab Bhattacharya, Most significant substring mining based on chi-square measure knowledge discovery and data mining. pp. 319- 327 ,(2010) , 10.1007/978-3-642-13657-3_35
Alain Denise, Mireille Régnier, Mathias Vandenbogaert, Assessing the Statistical Significance of Overrepresented Oligonucleotides workshop on algorithms in bioinformatics. pp. 85- 97 ,(2001) , 10.1007/3-540-44696-6_7
Sven Rahmann, Dynamic programming algorithms for two statistical problems in computational biology. workshop on algorithms in bioinformatics. ,vol. 2812, pp. 151- 164 ,(2003) , 10.1007/978-3-540-39763-2_12
Nong Ye, Qiang Chen, An anomaly detection technique based on a chi‐square statistic for detecting intrusions into information systems Quality and Reliability Engineering International. ,vol. 17, pp. 105- 112 ,(2001) , 10.1002/QRE.392
MIREILLE RÉGNIER, MATHIAS VANDENBOGAERT, COMPARISON OF STATISTICAL SIGNIFICANCE CRITERIA Journal of Bioinformatics and Computational Biology. ,vol. 4, pp. 537- 551 ,(2006) , 10.1142/S0219720006002028
D. Bianchi, B. Tirozzi, Identifying short motifs by means of extreme value analysis EPL. ,vol. 84, pp. 18001- ,(2008) , 10.1209/0295-5075/84/18001
Eamonn Keogh, Stefano Lonardi, Bill 'Yuan-chi' Chiu, Finding surprising patterns in a time series database in linear time and space Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 550- 556 ,(2002) , 10.1145/775047.775128
Gill Bejerano, Nir Friedman, Naftali Tishby, Efficient exact p-value computation for small sample, sparse, and surprising categorical data. Journal of Computational Biology. ,vol. 11, pp. 867- 886 ,(2004) , 10.1089/CMB.2004.11.867
Noel A. C. Cressie, Timothy R. C. Read, Goodness-of-Fit Statistics for Discrete Multivariate Data ,(1988)