The Icelandic Parsed Historical Corpus (IcePaHC)

作者: Anton Karl Ingason , Joel Wallenberg , Eir'ikur R"ognvaldsson , Einar Freyr Sigurðsson

DOI:

关键词:

摘要: The Icelandic Parsed Historical Corpus (IcePaHC) is a manually corrected treebank, parsed according to the annotation guidelines of Penn Corpora English (PPCHE), with minor modifications that are specific Icelandic. It consists about 1 million words from 12th century 21st. samples in corpus close being evenly distributed over this period. Most text narratives and religious material but some other genres also included. file format labeled bracketing as Treebank UTF-8 encoding. released under CC BY 4.0 license. Sogulegi islenski trjabankinn er handleiðrettur trjabanki sem greindur samkvaemt þattunarskema sogulegu ensku Penn-trjabankanna (Penn English; PPCHE). Bankinn inniheldur um milljon lesmalsorða fra 12. til 21. aldar. Gognin i malheildinni eru tiltolulega jafndreifð yfir þetta timabil. Flestir textarnir frasagnartextar eða truartextar en einnig að raeða einhver daemi aðrar textategundir. Skraarsniðið svigasnið (e. bracketing) eins og Penn-trjabankanum textinn stafasetti. Malheildinni dreift með leyfi.

参考文章(11)
Ann Taylor, The York—Toronto—Helsinki parsed corpus of old english prose Palgrave Macmillan, London. pp. 196- 227 ,(2007) , 10.1057/9780230223202_9
Joakim Nivre, Johan Hall, Jens Nilsson, Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation language resources and evaluation. pp. 1392- 1395 ,(2006)
Eiríkur Rögnvaldsson, Sigrún Helgadóttir, None, Morphosyntactic Tagging of Old Icelandic Texts and Its Use in Studying Syntactic Variation and Change Language Technology for Cultural Heritage. pp. 63- 76 ,(2011) , 10.1007/978-3-642-20227-8_4
Hrafn Loftsson, Tagging Icelandic text: A linguistic rule-based approach Nordic Journal of Linguistics. ,vol. 31, pp. 47- 72 ,(2008) , 10.1017/S0332586508001820
Anton Karl Ingason, Sigrún Helgadóttir, Hrafn Loftsson, Eiríkur Rögnvaldsson, A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI) Advances in Natural Language Processing. pp. 205- 216 ,(2008) , 10.1007/978-3-540-85287-2_20
Eiríkur Rögnvaldsson, Hrafn Loftsson, IceParser: An Incremental Finite-State Parser for Icelandic Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007). pp. 128- 135 ,(2007)
David D. Palmer, Marti A. Hearst, Adaptive Sentence Boundary Disambiguation conference on applied natural language processing. pp. 78- 83 ,(1994) , 10.3115/974358.974376
Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series Theory and Applications of Natural Language Processing: Edited Volumes. ,(2011) , 10.1007/978-3-642-20227-8
Eckhard Bick, Arboretum, a Hybrid Treebank for Danish Nordic Language Technology, Årbog for Nordisk Sprogteknologisk Forskningsprogram 2000-2004 (Yearbook 2003). (Red. Henrik Holmboe). pp. 207- 220 ,(2004)
Eiríkur Rögnvaldsson, Anton Karl Ingason, Einar Freyr Sigurðsson, Joel Wallenberg, Creating a dual-purpose treebank Journal for Language Technology and Computational Linguistics. ,vol. 26, pp. 139- 150 ,(2011)