Japanese web corpus with difficulty levels jpWaC-L 1.0

作者: Kristina Hmeljak Sangawa , Yoshiko Kawamura , Tomaž Erjavec

DOI:

关键词: HeuristicsLemma (morphology)SentenceComputer scienceJapanese languageLinguisticsSentence lengthArtificial intelligenceEnglish languageProficiency testNatural language processingWord (group theory)

摘要: The corpus contains over 300 million words, with annotations of words and sentences describing their difficulty levels. Words are assigned levels according to the Japanese Language Proficiency Test Content Specifications (2004). level is computed using various heuristics, based on (difficulty of) sentence length, etc. was collected from Web WaCkY tools, part-of-speech tagged lemmatised Chasen. Chasen tags have also been converted English language tags. The corpora made available in vertical format. Structural attributes (sentence). Each text gives its @url @domain. Sentences @level attribute, which describes level. positional are: 1. token, as it appears text 2. lemma word 3. tag, translated English 4. original tag Japanese 5. word. The complete split into sub-corpora same

参考文章(0)