作者: Jefrey Lijffijt
DOI: 10.1007/978-3-642-40988-2_25
关键词:
摘要: We consider the problem of mining subsequences with surprising event counts. When patterns, we often test a very large number potentially present leading to high likelihood finding spurious results. Typically, this grows as size data increases. Existing methods for statistical testing are not usable patterns in big data, because they either computationally too demanding, or fail take into account dependency structure between true findings going unnoticed. propose new method compute significance frequencies long sequence. The is based on analyzing joint distribution omitting need randomization. argue that computing p-values exactly costly, but an upper bound easy compute. investigate tightness and compare power alternative post-hoc correction. demonstrate utility two types data: text DNA. show proposed implement can be computed quickly. Moreover, conclude sufficiently tight meaningful results obtained practice.