作者: Mayank Sachan , Arnab Bhattacharya
关键词: Statistic 、 Mathematics 、 Randomness 、 String (computer science) 、 Set (abstract data type) 、 Statistics 、 Empirical distribution function 、 Pearson's chi-squared test 、 Substring index 、 Substring
摘要: The problem of identification statistically significant patterns in a sequence data has been applied to many domains such as intrusion detection systems, financial models, web-click records, automated monitoring computational biology, cryptology, and text analysis. An observed pattern events is deemed be if it unlikely have occurred due randomness or chance alone. We use the chi-square statistic quantitative measure statistical significance. Given string characters generated from memoryless Bernoulli model, identify substring for which empirical distribution single letters deviates most expected generative model. This deviation captured using measure. (MSS) thus defined having highest value. Till date, best our knowledge, there does not exist any algorithm find MSS better than O(n2) time, where n denotes length string. In this paper, we propose an substring, whose running time O(n3/2) with high probability. also study some variants finding top-t set, all substrings greater fixed threshold among given length. experimentally demonstrate asymptotic behavior on varying size alphabet size. describe applications cryptology real world finance sports. Finally, compare technique existing heuristics MSS.