Detecting complex predicates in Hindi using POS projection across parallel corpora

作者: Amitabha Mukerjee , Ankit Soni , Achla M. Raina

DOI: 10.3115/1613692.1613699

关键词: Natural language processingArtificial intelligenceProcess (engineering)HindiParallel corporaVerbSpeech recognitionSequenceParsingComputer scienceProjection (relational algebra)Identification (information)

摘要: Complex Predicates or CPs are multiword complexes functioning as single verbal units. particularly pervasive in Hindi and other Indo-Aryan languages, but an usage account driven by corpus-based identification of these constructs has not been possible since single-language systems based on rules statistical approaches require reliable tools (POS taggers, parsers, etc.) that unavailable for Hindi. This paper highlights the development first such database simple idea projecting POS tags across English-Hindi parallel corpus. The CP types considered include adjective-verb (AV), noun-verb (NV), adverb-verb (Adv-V), verb-verb (VV) composites. hypothesized where a verb English is projected onto multi-word sequence While this process misses some CPs, those detected appear to be more (83% precision, 46% recall). resulting lists instances 1439 4400 sentences.

参考文章(9)
Manindra K. Verma, Complex predicates in South Asian languages Manohar. ,(1993)
Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann Copestake, Dan Flickinger, Multiword Expressions: A Pain in the Neck for NLP international conference on computational linguistics. pp. 1- 15 ,(2002) , 10.1007/3-540-45715-1_1
Mitch Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, None, Building a large annotated corpus of English: the penn treebank Computational Linguistics. ,vol. 19, pp. 313- 330 ,(1993) , 10.21236/ADA273556
Franz Josef Och, Hermann Ney, Improved statistical alignment models Proceedings of the 38th Annual Meeting on Association for Computational Linguistics - ACL '00. pp. 440- 447 ,(2000) , 10.3115/1075218.1075274
David Yarowsky, Grace Ngai, None, Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001 - NAACL '01. pp. 1- 8 ,(2001) , 10.3115/1073336.1073362
SantoriniBeatrice, MarcinkiewiczMary Ann, P MarcusMitchell, Building a Large Annotated Corpus of English: The Computational Linguistics. ,(1993) , 10.5555/972470.972475
Dekang Lin, Automatic Identification of Non-compositional Phrases meeting of the association for computational linguistics. pp. 317- 324 ,(1999) , 10.3115/1034678.1034730
Miriam Butt, Wilhelm Geuder, Light verbs in Urdu and grammaticalization Words in Time. pp. 295- 350 ,(2003) , 10.1515/9783110899979.295
Eric Brill, Some Advances in Transformation-Based Part of Speech Tagging arXiv: Computation and Language. ,(1994)