Fishing for Exactness

作者: Ted Pedersen

DOI:

关键词:

摘要: Statistical methods for automatically identifying dependent word pairs (i.e. bigrams) in a corpus of natural language text have traditionally been performed using asymptotic tests significance. This paper suggests that Fisher's exact test is more appropriate due to the skewed and sparse data samples typical this problem. Both theoretical experimental comparisons between variety (the t-test, Pearson's chi-square test, Likelihood-ratio test) are presented. These show reliable pairs. The usefulness extends other problems statistical processing as appears be rule language. experiment presented was PROC FREQ SAS System.

参考文章(5)
K.W. Church, Using Statistics in Lexical Analysis Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon. pp. 115- 164 ,(1991)
Ronald Fisher, The Design of Experiments ,(1935)
Mitch Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, None, Building a large annotated corpus of English: the penn treebank Computational Linguistics. ,vol. 19, pp. 313- 330 ,(1993) , 10.21236/ADA273556
Ted Dunning, Accurate methods for the statistics of surprise and coincidence Computational Linguistics. ,vol. 19, pp. 61- 74 ,(1993)
G. K. Zipf, Miles A. Tinker, The psycho-biology of language ,(1935)