Corpus of Slovenian school texts SBSJ 1.0

作者: Kozma Ahačič , Simon Atelšek , Tomaž Erjavec , Peter Holozan , Nataša Jakop

DOI:

关键词:

摘要: Description Corpus of Slovenian school texts is a lemmatized and POS-tagged specialized corpus, which includes 428 short school texts written primarily by primary-school students from 1st to 5th grades from 2017 to 2020. The corpus consists of approximately 95,000 tokens and was designed as one of the resources for the compilation of The School Dictionary of the Slovenian Language, which is being created as part of the project Franček Web Portal, Language Counselling for Slovene Teachers and School Dictionary of the Slovene Language. The corpus was lemmatized and POS-tagged with the Obeliks tagger (http://oznacevalnik. slovenscina. eu/Vsebine/Sl/ProgramskaOprema/Navodila. aspx) using JOS morphosyntactic descriptions. The corpus is written in XML and complies with TEI specifications as given in the CLARIN. SI customisation (https://github. com/clarinsi/TEI-schema). Note that the corpus is …

参考文章(0)