Statistical Corpus and Language Comparison on Comparable Corpora

作者: Thomas Eckart , Uwe Quasthoff

DOI: 10.1007/978-3-642-20128-8_8

关键词:

摘要: With the wide availability of textual data in various languages, domains and registers it is easy to create text corpora for a variety applications. These include, among many others, field Natural Language Processing. The Leipzig Corpora Collection creates uses such more than fifteen years. However, work on preprocessing distributed resources ensure homogeneity thus comparability steady process. As result created identical formats allow use different statistical methods generate manual or automatic analysis. are basis applications intra- inter-language comparison quality assurance stocks.

参考文章(8)
Peter Grzybek, History and Methodology of Word Length Studies Springer, Dordrecht. pp. 15- 90 ,(2007) , 10.1007/978-1-4020-4068-9_2
Adam Kilgarriff, Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora Journal of Visual Languages and Computing. ,(1997)
Duncan J. Watts, Steven H. Strogatz, Collective dynamics of small-world networks Nature. ,vol. 393, pp. 440- 442 ,(1998) , 10.1038/30918
Ted Dunning, Accurate methods for the statistics of surprise and coincidence Computational Linguistics. ,vol. 19, pp. 61- 74 ,(1993)
Y. Li, D. McLean, Z.A. Bandar, J.D. O'Shea, K. Crockett, Sentence similarity based on semantic nets and corpus statistics IEEE Transactions on Knowledge and Data Engineering. ,vol. 18, pp. 1138- 1150 ,(2006) , 10.1109/TKDE.2006.130
Yuen Ren Chao, George Kingsley Zipf, Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology Language. ,vol. 26, pp. 394- ,(1950) , 10.2307/409735
W. Detmar Meurers, Markus Dickinson, Detecting annotation errors in spoken language corpora Copenhagen studies in language. pp. 53- 66 ,(2006)
Christian Biemann, Uwe Quasthoff, Matthias Richter, Corpus Portal for Search in Monolingual Corpora language resources and evaluation. pp. 1799- 1802 ,(2006)