作者: Adam Kilgarriff , Vít Suchomel , Pavel Rychlý , Vojtěch Kovář , Miloš Jakubíček
DOI:
关键词:
摘要: Everyone working on general language would like their corpus to be bigger, wider-coverage, cleaner, duplicate-free, and with richer metadata. In this paper we describe out programme to build ever better corpora along these lines for all of the world’s major languages (plus some others). Baroni and Kilgarriff (2006), Sharoff et al (2009), (2010) present the case web and programmes in which a number them have been developed. TenTens are development from -- new family corpora of order 10 billion words. We how are building them, what built so far, shall continue maintaining keeping up to date the years ahead. While, as yet, they very little metadata, we are gather add metadata attribute by attribute. The available research at http://www.sketchengine.co.uk.