作者: Ted Pedersen
DOI:
关键词:
摘要: Statistical methods for automatically identifying dependent word pairs (i.e. bigrams) in a corpus of natural language text have traditionally been performed using asymptotic tests significance. This paper suggests that Fisher's exact test is more appropriate due to the skewed and sparse data samples typical this problem. Both theoretical experimental comparisons between variety (the t-test, Pearson's chi-square test, Likelihood-ratio test) are presented. These show reliable pairs. The usefulness extends other problems statistical processing as appears be rule language. experiment presented was PROC FREQ SAS System.