作者: Qaiser Abbas
DOI:
关键词: Phrase structure rules 、 Natural language processing 、 Treebank 、 Parsing 、 Artificial intelligence 、 Computer science 、 Part of speech 、 Context-free grammar 、 Grammar 、 Language identification 、 Dependency grammar
摘要: This work presents the development of URDU.KON-TB treebank, its annotation evaluation & guidelines and construction Urdu parser for a South Asian language Urdu. is comparatively an under-resourced reliable treebank will have significant impact on state-of-the-art automatic processing. The includes raw corpus containing 1400 sentences collected from Wikipedia Jang newspaper. contains text local international news, social stories, sports, culture, finance, religion, traveling, etc. hierarchal scheme adopted has combination phrase structure hyper dependency structure. A semi-semantic part speech tag set, syntactic set functional are proposed, which further revised during corpus. was performed manually. Due to addition morphology, speech, syntactical, semantical, clausal, grammatical miscellaneous features, linguistically rich. resulted in Urdu, called URDU.KON-TB. presented Chapter 3. For scheme, Krippendorff’s α co-efficient selected. statistical measure evaluate inter-annotator agreement. Randomly selected 100 were given five trained annotators annotation. annotated then evaluated using co-efficient. values agreement obtained syntactical 0.964, 0.817 0.806, respectively. 4. All three lie range perfect devised after this evaluation. updated version 2. parser, divided into 80% training data 20% test data. context free grammar extracted data, development. 10% held out 140 with average length 13.73 words per sentence. used parser. extended dynamic programming algorithm known as Earley parsing algorithm. extensions made discussed 5 along issues faced items can occur normal considered, e.g., punctuation, null elements, diacritics, headings, regard titles, Hadees (the statements prophets), anaphora sentence, others. PARSEVAL measures results By applying sufficiently rich model, gives 87% f-score outperforms multi-path-shift-reduce two stage Hindi simple 4.8%, 12.48% 22% increase recall, contribution overall computational resources By-products tagset, guidelines, sufficient encoded information morphologically tagged corpus, be taggers. These enhanced natural processing such probabilistic parsing, POS taggers, disambiguation spoken sentences, development, identification, sources linguistic inquiry psychological modeling, or pattern matching.