Using Cross-Lingual Projections to Generate Semantic Role Labeled Annotated Corpus for Urdu - A Resource Poor Language

作者: Smruthi Mukund , Rohini Srihari , Debanjan Ghosh

DOI:

关键词: Natural language processingSyntaxWord (computer architecture)Cross lingualArtificial intelligencePropBankResource poorAnnotationUrduScale (map)Computer science

摘要: In this paper we explore the possibility of using cross lingual projections that help to automatically induce role-semantic annotations in PropBank paradigm for Urdu, a resource poor language. This technique provides annotation based on word alignments. It is relatively inexpensive and has potential reduce human effort involved creating semantic role resources. The projection model exploits lexical as well syntactic information an English-Urdu parallel corpus. We show our method generates reasonably good with accuracy 92% short structured sentences. Using generated annotated corpus, conduct preliminary experiments create labeler Urdu. results though modest, are promising indicate generate large scale

参考文章(26)
Mariona Taulé, Maria Antònia Martí, Marta Recasens, AnCora: Multilevel Annotated Corpora for Catalan and Spanish language resources and evaluation. ,(2008)
Lawrence Philips, The double metaphone search algorithm The C Users Journal archive. ,vol. 18, pp. 38- 43 ,(2000)
Anette Frank, Aljoscha Burchardt, Approaching Textual Entailment with LFG and FrameNet Frames ,(2007)
Mitch Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, None, Building a large annotated corpus of English: the penn treebank Computational Linguistics. ,vol. 19, pp. 313- 330 ,(1993) , 10.21236/ADA273556
Alessandro Moschitti, Kernel methods, syntax and semantics for relational text categorization Proceeding of the 17th ACM conference on Information and knowledge mining - CIKM '08. pp. 253- 262 ,(2008) , 10.1145/1458082.1458118
Amitabha Mukerjee, Ankit Soni, Achla M. Raina, Detecting complex predicates in Hindi using POS projection across parallel corpora Proceedings of the Workshop on Multiword Expressions Identifying and Exploiting Underlying Properties - MWE '06. pp. 28- 35 ,(2006) , 10.3115/1613692.1613699
Chenhai Xi, Rebecca Hwa, A backoff model for bootstrapping resources for non-English languages Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing - HLT '05. pp. 851- 858 ,(2005) , 10.3115/1220575.1220682
David Yarowsky, Grace Ngai, Richard Wicentowski, Inducing multilingual text analysis tools via robust projection across aligned corpora Proceedings of the first international conference on Human language technology research - HLT '01. pp. 1- 8 ,(2001) , 10.3115/1072133.1072187
Smruthi Mukund, Rohini Srihari, Erik Peterson, An Information-Extraction System for Urdu---A Resource-Poor Language ACM Transactions on Asian Language Information Processing. ,vol. 9, pp. 15- ,(2010) , 10.1145/1838751.1838754