作者: Adam Kilgarriff
DOI:
关键词:
摘要: How similar are two corpora? A measure of corpus similarity would be very useful for lexicography and language engineering. Word frequency lists cheap easy to generate so a based on them use as quick guide in many circumstances; example, judge how newly available related existing resources, or it might port an NLP system designed work with one text type another. We show that can only interpreted the light homogeneity. The paper presents measure, XX 2 statistic, measuring both is compared rank-based shown outperform it. Some results presented. method evaluating accuracy introduced some using