作者: NATAŠA LOGAR BERGINC , SIMON KREK
DOI:
关键词:
摘要: The paper presents three publicly available corpora of contemporary Slovene: a) a monolingual dynamic corpus of written language Gigafida (1 billion words); b) a balanced subcorpus of written language KRES (100 million words); c) a reference corpus of spoken Slovene GOS (1 million words). The spoken and written data has been compiled since 2008. The billion-word corpus has already been compiled. The corpus is lemmatized and morpho-syntactically tagged, as well as partly syntactically annotated. All sorts of language features may be retrieved from it – syntactic and semantic information, as well as phraseology. Moreover, the corpus constitutes a basis for a lexical database and a modern corpus-based grammar, both of which are being developed within the project. The larger corpus is the foundation of a balanced subcorpus of the written language. The paper compares the main features of the two …