Czech Text Document Corpus v 2.0

作者: Ladislav Lenc , Pavel Král

DOI:

关键词: Layer (object-oriented design)CzechInformation retrievalText documentDocument classificationAgency (sociology)Computer scienceOrder (business)

摘要: This paper introduces "Czech Text Document Corpus v 2.0", a collection of text documents for automatic document classification in Czech language. It is composed the provided by News Agency and freely available research purposes at this http URL corpus was created order to facilitate straightforward comparison approaches on data. particularly dedicated evaluation multi-label approaches, because one usually labelled with more than label. Besides information about classes, also annotated morphological layer. further shows results selected state-of-the-art methods offer possibility an easy these approaches.

参考文章(4)
David Martin Ward Powers, None, Evaluation: from Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation arXiv: Learning. ,vol. 2, pp. 37- 63 ,(2011)
Tomáš Brychcín, Pavel Král, Novel Unsupervised Features for Czech Multi-label Document Classification mexican international conference on artificial intelligence. pp. 70- 79 ,(2014) , 10.1007/978-3-319-13647-9_8
Milan Straka, Jana Straková, Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies : August 3-4, 2017 Vancouver, Canada, 2017, ISBN 978-1-945626-70-8, págs. 88-99. pp. 88- 99 ,(2017) , 10.18653/V1/K17-3009
Ladislav Lenc, Pavel Král, Deep Neural Networks for Czech Multi-label Document Classification arXiv: Computation and Language. ,(2017) , 10.1007/978-3-319-75487-1_36