作者: Peter Holozan , Miha Grčar , Tomaž Erjavec , Nataša Logar , Simon Krek
DOI:
关键词: Encoding (semiotics) 、 Newspaper 、 Artificial intelligence 、 Natural language processing 、 Corpus linguistics 、 XML 、 Computer science
摘要: Corpus ccKres consists of 9,376 documents, each containing information about the source (e.g. newspapers, magazines), year publication, text type (fiction, newspaper), title and author if they are known. The corpus is POS-tagged lemmatised, encoded in XML TEI format (Text Encoding Initiative P5). contains approximately 9% Kres corpus, a balanced Slovene: http://eng.slovenscina.eu/korpusi/kres.