Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora

作者： Adam Kilgarriff

DOI:

关键词:

摘要: How similar are two corpora? A measure of corpus similarity would be very useful for lexicography and language engineering. Word frequency lists cheap easy to generate so a based on them use as quick guide in many circumstances; example, judge how newly available related existing resources, or it might port an NLP system designed work with one text type another. We show that can only interpreted the light homogeneity. The paper presents measure, XX 2 statistic, measuring both is compared rank-based shown outperform it. Some results presented. method evaluating accuracy introduced some using

aclweb.org 本地加速

uni-trier.de 本地加速

aclweb.org PDF 下载加速

参考文章(8)

David L. Wallace, Frederick Mosteller, Applied Bayesian and Classical Inference: The Case of The Federalist Papers ,(2012)

Knut Hofland, Stig Johansson, Word Frequencies in British and American English ,(1985)

K. Church, W. Gale, Inverse Document Frequency (IDF): A Measure of Deviations from Poisson meeting of the association for computational linguistics. pp. 283- 295 ,(1999) , 10.1007/978-94-017-2390-9_18

Eugene Charniak, Statistical Language Learning ,(1994)

Satoshi Sekine, The Domain Dependence of Parsing conference on applied natural language processing. pp. 96- 102 ,(1997) , 10.3115/974557.974572

Ted Dunning, Accurate methods for the statistics of surprise and coincidence Computational Linguistics. ,vol. 19, pp. 61- 74 ,(1993)

Douglas Biber, S Armstrong, Using register-diversified corpora for general language studies Computational Linguistics. ,vol. 19, pp. 219- 241 ,(1993)

Ted Pedersen, Fishing for Exactness arXiv: Computation and Language. ,(1996)

Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora

来源期刊

我的账户

Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora

来源期刊

相似文章 10

我的账户