作者: Sourav Dutta , Arnab Bhattacharya
DOI: 10.1007/978-3-642-13657-3_35
关键词: Heuristics 、 Data mining 、 String (computer science) 、 Measure (mathematics) 、 Data sequences 、 Theoretical computer science 、 Substring 、 Mathematics 、 Intrusion detection system 、 Chi-square test
摘要: Given the vast reservoirs of sequence data stored worldwide, efficient mining string databases such as intrusion detection systems, player statistics, texts, proteins, etc. has emerged a great challenge. Searching for an unusual pattern within long strings requirement diverse applications. string, problem then is to identify substrings that differs most from expected or normal behavior, i.e., are statistically significant (i.e., less likely occur due chance alone). To this end, we use chi-square measure and propose two heuristics retrieving top-k with largest measure. We show algorithms outperform other competing in runtime, while maintaining high approximation ratio more than 0.96.