Most significant substring mining based on chi-square measure

关键词: Heuristics 、 Data mining 、 String (computer science) 、 Measure (mathematics) 、 Data sequences 、 Theoretical computer science 、 Substring 、 Mathematics 、 Intrusion detection system 、 Chi-square test

摘要: Given the vast reservoirs of sequence data stored worldwide, efficient mining string databases such as intrusion detection systems, player statistics, texts, proteins, etc. has emerged a great challenge. Searching for an unusual pattern within long strings requirement diverse applications. string, problem then is to identify substrings that differs most from expected or normal behavior, i.e., are statistically significant (i.e., less likely occur due chance alone). To this end, we use chi-square measure and propose two heuristics retrieving top-k with largest measure. We show algorithms outperform other competing in runtime, while maintaining high approximation ratio more than 0.96.

uni-trier.de 本地加速

springer.com 本地加速

researchgate.net LINK 下载加速

sci-hub.st HTML 下载加速

参考文章(11)

Noel Cressie, Timothy R. C. Read, Pearson's X 2 and the Loglikelihood Ratio Statistic G 2 : A Comparative Review International Statistical Review / Revue Internationale de Statistique. ,vol. 57, pp. 19- 43 ,(1989) , 10.2307/1403582

Alain Denise, Mireille Régnier, Mathias Vandenbogaert, Assessing the Statistical Significance of Overrepresented Oligonucleotides workshop on algorithms in bioinformatics. pp. 85- 97 ,(2001) , 10.1007/3-540-44696-6_7

Sven Rahmann, Dynamic programming algorithms for two statistical problems in computational biology. workshop on algorithms in bioinformatics. ,vol. 2812, pp. 151- 164 ,(2003) , 10.1007/978-3-540-39763-2_12

Sourav Dutta, Arnab Bhattacharya, Mining Statistically Significant Substrings Based on the Chi-Square Measure arXiv: Databases. pp. 1599- 1608 ,(2010) , 10.4018/978-1-4666-3604-0.CH083

Nong Ye, Qiang Chen, An anomaly detection technique based on a chi‐square statistic for detecting intrusions into information systems Quality and Reliability Engineering International. ,vol. 17, pp. 105- 112 ,(2001) , 10.1002/QRE.392

MIREILLE RÉGNIER, MATHIAS VANDENBOGAERT, COMPARISON OF STATISTICAL SIGNIFICANCE CRITERIA Journal of Bioinformatics and Computational Biology. ,vol. 4, pp. 537- 551 ,(2006) , 10.1142/S0219720006002028

Eamonn Keogh, Stefano Lonardi, Bill 'Yuan-chi' Chiu, Finding surprising patterns in a time series database in linear time and space Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 550- 556 ,(2002) , 10.1145/775047.775128

Gill Bejerano, Nir Friedman, Naftali Tishby, Efficient exact p-value computation for small sample, sparse, and surprising categorical data. Journal of Computational Biology. ,vol. 11, pp. 867- 886 ,(2004) , 10.1089/CMB.2004.11.867

Noel A. C. Cressie, Timothy R. C. Read, Goodness-of-Fit Statistics for Discrete Multivariate Data ,(1988)

10.

H. Hotelling, Multivariate Quality Control Techniques of Statistical Analysis. ,(1947)

Most significant substring mining based on chi-square measure

来源期刊

我的账户

Most significant substring mining based on chi-square measure

来源期刊

相似文章 7

Mining Statistically Significant Substrings Based on the Chi-Square Measure

MIST: Top-k Approximate Sub-string Mining Using Triplet Statistical Significance

Impact of Inadequate Urban Planning on Municipal Solid Waste Management in the Niger Delta Region of Nigeria

Mining statistically significant substrings using the chi-square statistic

Neighbor-Aware Search for Approximate Labeled Graph Matching using the Chi-Square Statistics

Quality and Effectiveness of ERP Software: Data Mining Perspective

Most significant substring mining based on chi-square measure

我的账户