作者: Nikola Ljubešić , Żeljko Agić , Danijela Merkler
DOI:
关键词:
摘要: We investigate state-of-the-art statistical models for lemmatization and morphosyntactic tagging of Croatian Serbian. The stem from a new manually annotated SETIMES.HR corpus Croatian, based on the SETimes parallel corpus. train text evaluate them samples Serbian two Wikipedias. Lemmatization accuracy languages reaches 97.87% 96.30%, while full using 600-tag tagset peaks at 87.72% 85.56%, respectively. Part speech accuracies reach 97.13% 96.46%. Results indicate that more complex methods Croatian-to- annotation projection are not required such dataset sizes these particular tasks. corpus, its resulting test sets all made freely available .