A Fast and Simple Method for Mining Subsequences with Surprising Event Counts

作者: Jefrey Lijffijt

DOI: 10.1007/978-3-642-40988-2_25

关键词:

摘要: We consider the problem of mining subsequences with surprising event counts. When patterns, we often test a very large number potentially present leading to high likelihood finding spurious results. Typically, this grows as size data increases. Existing methods for statistical testing are not usable patterns in big data, because they either computationally too demanding, or fail take into account dependency structure between true findings going unnoticed. propose new method compute significance frequencies long sequence. The is based on analyzing joint distribution omitting need randomization. argue that computing p-values exactly costly, but an upper bound easy compute. investigate tightness and compare power alternative post-hoc correction. demonstrate utility two types data: text DNA. show proposed implement can be computed quickly. Moreover, conclude sufficiently tight meaningful results obtained practice.

参考文章(18)
Sami Hanhijärvi, None, Multiple hypothesis testing in pattern discovery discovery science. pp. 122- 134 ,(2011) , 10.1007/978-3-642-24477-3_12
Jefrey Lijffijt, Panagiotis Papapetrou, Kai Puolamäki, Size Matters: Finding the Most Informative Set of Window Lengths Machine Learning and Knowledge Discovery in Databases. ,vol. 7524, pp. 451- 466 ,(2012) , 10.1007/978-3-642-33486-3_29
Heikki Mannila, Local and Global Methods in Data Mining: Basic Techniques and Open Problems international colloquium on automata languages and programming. pp. 57- 68 ,(2002) , 10.1007/3-540-45465-9_6
Aristides Gionis, Heikki Mannila, Taneli Mielikäinen, Panayiotis Tsaparas, Assessing data mining results via swap randomization ACM Transactions on Knowledge Discovery From Data. ,vol. 1, pp. 14- ,(2007) , 10.1145/1297332.1297338
Jefrey Lijffijt, Panagiotis Papapetrou, Kai Puolamäki, A statistical significance testing approach to mining the most informative set of patterns Data Mining and Knowledge Discovery. ,vol. 28, pp. 238- 263 ,(2014) , 10.1007/S10618-012-0298-2
Tijl De Bie, Maximum entropy models and subjective interestingness: an application to tiles in binary databases Data Mining and Knowledge Discovery. ,vol. 23, pp. 407- 446 ,(2011) , 10.1007/S10618-010-0209-3
Eduardo G. Altmann, Janet B. Pierrehumbert, Adilson E. Motter, Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words PLoS ONE. ,vol. 4, pp. e7678- ,(2009) , 10.1371/JOURNAL.PONE.0007678
Sanat K. Sarkar, Chung-Kuei Chang, The Simes Method for Multiple Hypothesis Testing with Positively Dependent Test Statistics Journal of the American Statistical Association. ,vol. 92, pp. 1601- 1608 ,(1997) , 10.1080/01621459.1997.10473682