Significance testing of word frequencies in corpora

作者: Jefrey Lijffijt , Terttu Nevalainen , Tanja Säily , Panagiotis Papapetrou , Kai Puolamäki

DOI: 10.1093/LLC/FQU064

关键词: Text corpusContext (language use)Wilcoxon signed-rank testPragmaticsBritish National CorpusNatural language processingCorpus linguisticsWord lists by frequencyTheoretical linguisticsArtificial intelligenceComputer scienceLinguistics

摘要: … Because the bias that we observe does not solely depend on word frequency, we cannot simply use higher cut-off values in the χ 2 or log-likelihood ratio tests to correct the bias. Notably…

参考文章(48)
Jefrey Lijffijt, A Fast and Simple Method for Mining Subsequences with Surprising Event Counts european conference on machine learning. ,vol. 8188, pp. 385- 400 ,(2013) , 10.1007/978-3-642-40988-2_25
G. Berry, P. Armitage, Mid-P confidence intervals: a brief review The Statistician. ,vol. 44, pp. 417- 423 ,(1995) , 10.2307/2348891
S Evert, S Hoffmann, N Smith, Y Berglund-Prytz, D Lee, Corpus Linguistics with BNCweb - a Practical Guide ,(2008)
Frank Wilcoxon, Individual Comparisons by Ranking Methods Springer Series in Statistics. ,vol. 1, pp. 196- 202 ,(1992) , 10.1007/978-1-4612-4380-9_16
Mike Scott, Wordsmith Tools version 3 Oxford University Press. ,(1999)
Pierre L'Ecuyer, Richard Simard, TestU01: A C library for empirical testing of random number generators ACM Transactions on Mathematical Software. ,vol. 33, pp. 22- ,(2007) , 10.1145/1268776.1268777
Sandrine Dudoit, Juliet Popper Shaffer, Jennifer C. Block, Multiple Hypothesis Testing in Microarray Experiments Statistical Science. ,vol. 18, pp. 71- 103 ,(2003) , 10.1214/SS/1056397487
Frances Rock, Policy and practice in the anonymisation of linguistic data International Journal of Corpus Linguistics. ,vol. 6, pp. 1- 26 ,(2001) , 10.1075/IJCL.6.1.01ROC