Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora

作者: Adam Kilgarriff

DOI:

关键词:

摘要: How similar are two corpora? A measure of corpus similarity would be very useful for lexicography and language engineering. Word frequency lists cheap easy to generate so a based on them use as quick guide in many circumstances; example, judge how newly available related existing resources, or it might port an NLP system designed work with one text type another. We show that can only interpreted the light homogeneity. The paper presents measure, XX 2 statistic, measuring both is compared rank-based shown outperform it. Some results presented. method evaluating accuracy introduced some using

参考文章(8)
Knut Hofland, Stig Johansson, Word Frequencies in British and American English ,(1985)
K. Church, W. Gale, Inverse Document Frequency (IDF): A Measure of Deviations from Poisson meeting of the association for computational linguistics. pp. 283- 295 ,(1999) , 10.1007/978-94-017-2390-9_18
Eugene Charniak, Statistical Language Learning ,(1994)
Satoshi Sekine, The Domain Dependence of Parsing conference on applied natural language processing. pp. 96- 102 ,(1997) , 10.3115/974557.974572
Ted Dunning, Accurate methods for the statistics of surprise and coincidence Computational Linguistics. ,vol. 19, pp. 61- 74 ,(1993)
Douglas Biber, S Armstrong, Using register-diversified corpora for general language studies Computational Linguistics. ,vol. 19, pp. 219- 241 ,(1993)
Ted Pedersen, Fishing for Exactness arXiv: Computation and Language. ,(1996)