作者: Sourav Dutta , Arnab Bhattacharya
DOI: 10.4018/978-1-4666-3604-0.CH083
关键词: Theoretical computer science 、 Substring 、 Measure (mathematics) 、 Data mining 、 Chi-square test 、 String (computer science) 、 Heuristics 、 Computer science 、 Point (typography)
摘要: Given the vast reservoirs of data stored worldwide, efficient mining from a large information store has emerged as great challenge. Many databases like that intrusion detection systems, web-click records, player statistics, texts, proteins etc., strings or sequences. Searching for an unusual pattern within such long requirement diverse applications. string, problem then is to identify substrings differs most expected normal behavior, i.e., are statistically significant. In other words, these less likely occur due chance alone and may point some interesting phenomenon warrants further exploration. To this end, we use chi-square measure. We propose two heuristics retrieving top-k with largest show algorithms outperform competing in runtime, while maintaining high approximation ratio more than 0.96.