A Web Corpus and Word Sketches for Japanese

作者: Irena Srdanovic Erjavec , Tomaz Erjavec , Adam Kilgarriff

DOI: 10.5715/JNLP.15.2_137

关键词: Word lists by frequencyGrammarArtificial intelligenceSketchResource (project management)Natural language processingWord (computer architecture)Corpus linguisticsThesaurus (information retrieval)Computer scienceEncoding (semiotics)

摘要: Of all the major world languages, Japanese is lagging behind in terms of publicly accessible and searchable corpora. In this paper we describe development JpWaC (Japanese Web as Corpus), a large corpus 400 million words web text, its encoding for Sketch Engine. The Engine web-based query tool that supports fast concordancing, grammatical processing, ‘word sketching’ (one-page summaries word’s collocational behaviour), distributional thesaurus, robot use. We steps taken to gather process establish validity, kinds language it contains. then shallow grammar enable word sketching. believe loaded into will be useful resource wide number researchers, learners, NLP developers.

参考文章(27)
Douglas Biber, Dimensions of Register Variation ,(1995)
Silvia Bernardini, Marco Baroni, BootCaT: Bootstrapping corpora and terms from the web language resources and evaluation. pp. 1313- 1316 ,(2004)
Marco Baroni, Motoko Ueyama, Automated construction and evaluation of Japanese Web-based reference corpora Corpus Linguistics 2005. ,vol. 1, pp. 1- 12 ,(2005)
Karen Sparck Jones, Synonymy and Semantic Classification ,(1987)
Francis Heylighen, Jean-Marc Dewaele, VARIATION IN THE CONTEXTUALITY OF LANGUAGE: AN EMPIRICAL MEASURE Foundations of Science. ,vol. 7, pp. 293- 340 ,(2002) , 10.1023/A:1019661126744
Adam Kilgarriff, Pavel Rychlý, Pavel Smrž, David Tugwell, The Sketch Engine Proceedings of the Corpus Linguistics Conference 2009 (CL2009),, 2009, pág. 177. pp. 105- 116 ,(2004)
Massimiliano Ciaramita, Marco Baroni, A Figure of Merit for the Evaluation of Web-Corpus Randomness conference of the european chapter of the association for computational linguistics. pp. 127- 158 ,(2006)
Oliver Christ, A Modular and Flexible Architecture for an Integrated Corpus Query System. arXiv: Computation and Language. ,(1994)
F. Chantree, A. Willis, A. De Roeck, A. Kilgarriff, Disambiguating coordinations using word distribution information ,(2005)
Julie Weeds, David Weir, Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity Computational Linguistics. ,vol. 31, pp. 439- 475 ,(2005) , 10.1162/089120105775299122