作者: Riyaz Ahmad Bhat , Dipti Misra Sharma
DOI:
关键词:
摘要: In this paper we describe a currently underway treebanking effort for Urdu-a South Asian language. The treebank is built from newspaper corpus and uses Karaka based grammatical framework inspired by Paninian theory. Thus far 3366 sentences (0.1M words) have been annotated with the linguistic information at morpho-syntactic (morphological, part-of-speech chunk information) syntactico-semantic (dependency) levels. This work also aims to evaluate correctness or reliability of manual dependency treebank. Evaluation done measuring inter-annotator agreement on manually data set 196 (5600 two annotators. We present qualitative analysis statistics identify possible reasons disagreement between show syntactic annotation some constructions specific Urdu like Ezafe discuss problem word segmentation (tokenization).