Japanese web corpus with difficulty levels jpWaC-L 1.0

作者： Kristina Hmeljak Sangawa , Yoshiko Kawamura , Tomaž Erjavec

DOI:

关键词: Heuristics 、 Lemma (morphology) 、 Sentence 、 Computer science 、 Japanese language 、 Linguistics 、 Sentence length 、 Artificial intelligence 、 English language 、 Proficiency test 、 Natural language processing 、 Word (group theory)

摘要: The corpus contains over 300 million words, with annotations of words and sentences describing their difficulty levels. Words are assigned levels according to the Japanese Language Proficiency Test Content Specifications (2004). level is computed using various heuristics, based on (difficulty of) sentence length, etc. was collected from Web WaCkY tools, part-of-speech tagged lemmatised Chasen. Chasen tags have also been converted English language tags. The corpora made available in vertical format. Structural attributes (sentence). Each text gives its @url @domain. Sentences @level attribute, which describes level. positional are: 1. token, as it appears text 2. lemma word 3. tag, translated English 4. original tag Japanese 5. word. The complete split into sub-corpora same

clarin.si 本地加速

暂无可下载资源，当前可以选择系统获取到有开放资源时通知我或者直接发起求助文献求助

参考文章(0)

Japanese web corpus with difficulty levels jpWaC-L 1.0

来源期刊

我的账户

Japanese web corpus with difficulty levels jpWaC-L 1.0

来源期刊

相似文章 0

我的账户