A Fast and Simple Method for Mining Subsequences with Surprising Event Counts

关键词:

摘要: We consider the problem of mining subsequences with surprising event counts. When patterns, we often test a very large number potentially present leading to high likelihood finding spurious results. Typically, this grows as size data increases. Existing methods for statistical testing are not usable patterns in big data, because they either computationally too demanding, or fail take into account dependency structure between true findings going unnoticed. propose new method compute significance frequencies long sequence. The is based on analyzing joint distribution omitting need randomization. argue that computing p-values exactly costly, but an upper bound easy compute. investigate tightness and compare power alternative post-hoc correction. demonstrate utility two types data: text DNA. show proposed implement can be computed quickly. Moreover, conclude sufficiently tight meaningful results obtained practice.

参考文章(18)

Sami Hanhijärvi, None, Multiple hypothesis testing in pattern discovery discovery science. pp. 122- 134 ,(2011) , 10.1007/978-3-642-24477-3_12

Jefrey Lijffijt, Panagiotis Papapetrou, Kai Puolamäki, Size Matters: Finding the Most Informative Set of Window Lengths Machine Learning and Knowledge Discovery in Databases. ,vol. 7524, pp. 451- 466 ,(2012) , 10.1007/978-3-642-33486-3_29

Heikki Mannila, Local and Global Methods in Data Mining: Basic Techniques and Open Problems international colloquium on automata languages and programming. pp. 57- 68 ,(2002) , 10.1007/3-540-45465-9_6

Aristides Gionis, Heikki Mannila, Taneli Mielikäinen, Panayiotis Tsaparas, Assessing data mining results via swap randomization ACM Transactions on Knowledge Discovery From Data. ,vol. 1, pp. 14- ,(2007) , 10.1145/1297332.1297338

Jefrey Lijffijt, Panagiotis Papapetrou, Kai Puolamäki, A statistical significance testing approach to mining the most informative set of patterns Data Mining and Knowledge Discovery. ,vol. 28, pp. 238- 263 ,(2014) , 10.1007/S10618-012-0298-2

Tijl De Bie, Maximum entropy models and subjective interestingness: an application to tiles in binary databases Data Mining and Knowledge Discovery. ,vol. 23, pp. 407- 446 ,(2011) , 10.1007/S10618-010-0209-3

Geoffrey I. Webb, Layered critical values: a powerful direct-adjustment approach to discovering significant patterns Machine Learning. ,vol. 71, pp. 307- 323 ,(2008) , 10.1007/S10994-008-5046-X

Eduardo G. Altmann, Janet B. Pierrehumbert, Adilson E. Motter, Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words PLoS ONE. ,vol. 4, pp. e7678- ,(2009) , 10.1371/JOURNAL.PONE.0007678

Giorgio Bernardi, Isochores and the evolutionary genomics of vertebrates. Gene. ,vol. 241, pp. 3- 17 ,(2000) , 10.1016/S0378-1119(99)00485-0

10.

Sanat K. Sarkar, Chung-Kuei Chang, The Simes Method for Multiple Hypothesis Testing with Positively Dependent Test Statistics Journal of the American Statistical Association. ,vol. 92, pp. 1601- 1608 ,(1997) , 10.1080/01621459.1997.10473682

A Fast and Simple Method for Mining Subsequences with Surprising Event Counts

来源期刊

我的账户

A Fast and Simple Method for Mining Subsequences with Surprising Event Counts

来源期刊

相似文章 2

A probabilistic model for bursty topic discovery in microblogs

Significance testing of word frequencies in corpora

我的账户