作者: Kristina Hmeljak Sangawa , Yoshiko Kawamura , Tomaž Erjavec
DOI:
关键词: Heuristics 、 Lemma (morphology) 、 Sentence 、 Computer science 、 Japanese language 、 Linguistics 、 Sentence length 、 Artificial intelligence 、 English language 、 Proficiency test 、 Natural language processing 、 Word (group theory)
摘要: The corpus contains over 300 million words, with annotations of words and sentences describing their difficulty levels. Words are assigned levels according to the Japanese Language Proficiency Test Content Specifications (2004). level is computed using various heuristics, based on (difficulty of) sentence length, etc. was collected from Web WaCkY tools, part-of-speech tagged lemmatised Chasen. Chasen tags have also been converted English language tags. The corpora made available in vertical format. Structural attributes (sentence). Each text gives its @url @domain. Sentences @level attribute, which describes level. positional are: 1. token, as it appears text 2. lemma word 3. tag, translated English 4. original tag Japanese 5. word. The complete split into sub-corpora same