Mining Statistically Significant Substrings Based on the Chi-Square Measure

作者： Sourav Dutta , Arnab Bhattacharya

关键词: Theoretical computer science 、 Substring 、 Measure (mathematics) 、 Data mining 、 Chi-square test 、 String (computer science) 、 Heuristics 、 Computer science 、 Point (typography)

摘要: Given the vast reservoirs of data stored worldwide, efficient mining from a large information store has emerged as great challenge. Many databases like that intrusion detection systems, web-click records, player statistics, texts, proteins etc., strings or sequences. Searching for an unusual pattern within such long requirement diverse applications. string, problem then is to identify substrings differs most expected normal behavior, i.e., are statistically significant. In other words, these less likely occur due chance alone and may point some interesting phenomenon warrants further exploration. To this end, we use chi-square measure. We propose two heuristics retrieving top-k with largest show algorithms outperform competing in runtime, while maintaining high approximation ratio more than 0.96.

参考文章(11)

Noel Cressie, Timothy R. C. Read, Pearson's X 2 and the Loglikelihood Ratio Statistic G 2 : A Comparative Review International Statistical Review / Revue Internationale de Statistique. ,vol. 57, pp. 19- 43 ,(1989) , 10.2307/1403582

Sourav Dutta, Arnab Bhattacharya, Most significant substring mining based on chi-square measure knowledge discovery and data mining. pp. 319- 327 ,(2010) , 10.1007/978-3-642-13657-3_35

Alain Denise, Mireille Régnier, Mathias Vandenbogaert, Assessing the Statistical Significance of Overrepresented Oligonucleotides workshop on algorithms in bioinformatics. pp. 85- 97 ,(2001) , 10.1007/3-540-44696-6_7

Sven Rahmann, Dynamic programming algorithms for two statistical problems in computational biology. workshop on algorithms in bioinformatics. ,vol. 2812, pp. 151- 164 ,(2003) , 10.1007/978-3-540-39763-2_12

Nong Ye, Qiang Chen, An anomaly detection technique based on a chi‐square statistic for detecting intrusions into information systems Quality and Reliability Engineering International. ,vol. 17, pp. 105- 112 ,(2001) , 10.1002/QRE.392

MIREILLE RÉGNIER, MATHIAS VANDENBOGAERT, COMPARISON OF STATISTICAL SIGNIFICANCE CRITERIA Journal of Bioinformatics and Computational Biology. ,vol. 4, pp. 537- 551 ,(2006) , 10.1142/S0219720006002028

D. Bianchi, B. Tirozzi, Identifying short motifs by means of extreme value analysis EPL. ,vol. 84, pp. 18001- ,(2008) , 10.1209/0295-5075/84/18001

Eamonn Keogh, Stefano Lonardi, Bill 'Yuan-chi' Chiu, Finding surprising patterns in a time series database in linear time and space Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 550- 556 ,(2002) , 10.1145/775047.775128

Gill Bejerano, Nir Friedman, Naftali Tishby, Efficient exact p-value computation for small sample, sparse, and surprising categorical data. Journal of Computational Biology. ,vol. 11, pp. 867- 886 ,(2004) , 10.1089/CMB.2004.11.867

10.

Noel A. C. Cressie, Timothy R. C. Read, Goodness-of-Fit Statistics for Discrete Multivariate Data ,(1988)

Mining Statistically Significant Substrings Based on the Chi-Square Measure

来源期刊

我的账户

Mining Statistically Significant Substrings Based on the Chi-Square Measure

来源期刊

相似文章 3

Most significant substring mining based on chi-square measure

Impact of Inadequate Urban Planning on Municipal Solid Waste Management in the Niger Delta Region of Nigeria

Mining statistically significant substrings using the chi-square statistic

我的账户