The TenTen Corpus Family

作者: Adam Kilgarriff , Vít Suchomel , Pavel Rychlý , Vojtěch Kovář , Miloš Jakubíček

DOI:

关键词:

摘要: Everyone working on general language would like their corpus to be bigger, wider-coverage, cleaner, duplicate-free, and with richer metadata. In this paper we describe out programme to build ever better corpora along these lines for all of the world’s major languages (plus some others). Baroni and Kilgarriff (2006), Sharoff et al (2009), (2010) present the case web and programmes in which a number them have been developed. TenTens are development from -- new family corpora of order 10 billion words. We how are building them, what built so far, shall continue maintaining keeping up to date the years ahead. While, as yet, they very little metadata, we are gather add metadata attribute by attribute. The available research at http://www.sketchengine.co.uk.

参考文章(10)
Adam Kilgarriff, Jan Pomikálek, Siva Reddy, Avinesh Pvs, A Corpus Factory for Many Languages language resources and evaluation. ,(2010)
Adam Kilgarriff, Pavel Rychlý, Pavel Smrž, David Tugwell, The Sketch Engine Proceedings of the Corpus Linguistics Conference 2009 (CL2009),, 2009, pág. 177. pp. 105- 116 ,(2004)
Marco Baroni, Adam Kilgarriff, Large linguistically-processed web corpora for multiple languages conference of the european chapter of the association for computational linguistics. pp. 87- 90 ,(2006) , 10.3115/1608974.1608976
Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Eros Zanchetta, The WaCky wide web: a collection of very large linguistically processed web-crawled corpora language resources and evaluation. ,vol. 43, pp. 209- 226 ,(2009) , 10.1007/S10579-009-9081-4
Adam Kilgarriff, None, Simple Maths for Keywords Proceedings of the Corpus Linguistics Conference 2009 (CL2009),, 2009, pág. 171. pp. 171- ,(2009)
Adam Kilgarriff, Pavel Rychlý, Milos Husak, Michael Rundell, Katy McAdam, GDEX: Automatically Finding Good Dictionary Examples in a Corpus Proceedings of the XIII EURALEX International Congress (Barcelona, 15-19 July 2008), 2008, ISBN 978-84-96742-67-3, págs. 425-432. pp. 425- 432 ,(2008)
Vít Suchomel, Jan Pomikálek, Efficient Web Crawling for Large Text Corpora pp. 39- 43 ,(2012)
Jan Pomikálek, Removing Boilerplate and Duplicate Content from Web Corpora Masarykova univerzita. ,(2011)
Jeremy Clear, Trawling the language: Monitor corpora Proceedings of the 2nd EURALEX International Congress. pp. 383- 389 ,(1986)
Adam Kilgarriff, Pavel Rychlý, Pavel Smrz, David Tugwell, The Sketch Engine Proceedings of the 11th EURALEX International Congress. pp. 105- 115 ,(2004)