作者: Peter Holozan , Miha Grčar , Tomaž Erjavec , Nataša Logar , Simon Krek
DOI:
关键词:
摘要: Corpus ccGigafida consists of paragraph samples from 31,722 documents, each containing information about the source (e.g. newspapers, magazines), year publication, text type (fiction, newspaper), title and author if they are known. The corpus is annotated with morphosyntactic descriptions (PoS-tagged) lemmatised. It encoded in XML TEI format (Text Encoding Initiative P5). contains approximately 9% Gigafida corpus, a reference Slovene: http://eng.slovenscina.eu/korpusi/gigafida. The available TEI-like simpler smaller vertical format, used by various concordancers. file has PoS (MSD) tags Slovenian only, while both English. also as plain text, on per text.