Building Computational Resources : The URDU.KON-TB Treebank and the Urdu Parser

作者: Qaiser Abbas

DOI:

关键词: Phrase structure rulesNatural language processingTreebankParsingArtificial intelligenceComputer sciencePart of speechContext-free grammarGrammarLanguage identificationDependency grammar

摘要: This work presents the development of URDU.KON-TB treebank, its annotation evaluation & guidelines and construction Urdu parser for a South Asian language Urdu. is comparatively an under-resourced reliable treebank will have significant impact on state-of-the-art automatic processing. The includes raw corpus containing 1400 sentences collected from Wikipedia Jang newspaper. contains text local international news, social stories, sports, culture, finance, religion, traveling, etc. hierarchal scheme adopted has combination phrase structure hyper dependency structure. A semi-semantic part speech tag set, syntactic set functional are proposed, which further revised during corpus. was performed manually. Due to addition morphology, speech, syntactical, semantical, clausal, grammatical miscellaneous features, linguistically rich. resulted in Urdu, called URDU.KON-TB. presented Chapter 3. For scheme, Krippendorff’s α co-efficient selected. statistical measure evaluate inter-annotator agreement. Randomly selected 100 were given five trained annotators annotation. annotated then evaluated using co-efficient. values agreement obtained syntactical 0.964, 0.817 0.806, respectively. 4. All three lie range perfect devised after this evaluation. updated version 2. parser, divided into 80% training data 20% test data. context free grammar extracted data, development. 10% held out 140 with average length 13.73 words per sentence. used parser. extended dynamic programming algorithm known as Earley parsing algorithm. extensions made discussed 5 along issues faced items can occur normal considered, e.g., punctuation, null elements, diacritics, headings, regard titles, Hadees (the statements prophets), anaphora sentence, others. PARSEVAL measures results By applying sufficiently rich model, gives 87% f-score outperforms multi-path-shift-reduce two stage Hindi simple 4.8%, 12.48% 22% increase recall, contribution overall computational resources By-products tagset, guidelines, sufficient encoded information morphologically tagged corpus, be taggers. These enhanced natural processing such probabilistic parsing, POS taggers, disambiguation spoken sentences, development, identification, sources linguistic inquiry psychological modeling, or pattern matching.

参考文章(105)
Wajid Ali, Sarmad Hussain, Urdu Dependency Parser: A Data-Driven approach ,(2010)
Rebecca J. Passonneau, Diane J. Litman, Empirical Analysis of Three Dimensions of Spoken Discourse: Segmentation, Coherence, and Linguistic Devices Springer, Berlin, Heidelberg. pp. 161- 194 ,(1996) , 10.1007/978-3-662-03293-0_7
Seth Kulick, Ryan Gabbard, Mitchell Marcus, Parsing the Arabic Treebank: Analysis and Improvements ,(2006)
Riyaz Ahmad Bhat, Dipti Misra Sharma, Dependency Treebank of Urdu and its Evaluation linguistic annotation workshop. pp. 157- 165 ,(2012)
Bhasha Agrawal, Rahul Agarwal, Samar Husain, Dipti M. Sharma, An automatic approach to treebank error detection using a dependency parser international conference on computational linguistics. pp. 294- 303 ,(2013) , 10.1007/978-3-642-37247-6_24
Miriam Butt, The Light Verb Jungle ,(2003)
Geoffrey Leech, Adding linguistic annotation. Oxbow Books. ,(2005)
Marie Mikulová, Jan Stepánek, Ways of Evaluation of the Annotators in Building the Prague Czech-English Dependency Treebank language resources and evaluation. ,(2010)