作者: Irena Srdanovic Erjavec , Tomaz Erjavec , Adam Kilgarriff
关键词: Word lists by frequency 、 Grammar 、 Artificial intelligence 、 Sketch 、 Resource (project management) 、 Natural language processing 、 Word (computer architecture) 、 Corpus linguistics 、 Thesaurus (information retrieval) 、 Computer science 、 Encoding (semiotics)
摘要: Of all the major world languages, Japanese is lagging behind in terms of publicly accessible and searchable corpora. In this paper we describe development JpWaC (Japanese Web as Corpus), a large corpus 400 million words web text, its encoding for Sketch Engine. The Engine web-based query tool that supports fast concordancing, grammatical processing, ‘word sketching’ (one-page summaries word’s collocational behaviour), distributional thesaurus, robot use. We steps taken to gather process establish validity, kinds language it contains. then shallow grammar enable word sketching. believe loaded into will be useful resource wide number researchers, learners, NLP developers.