Developing language technology tools and resources for a resource-poor language: Sindhi

作者: Raveesh Motlani

DOI: 10.18653/V1/N16-2008

关键词: Resource poorComputer scienceWorld Wide WebSindhiArtificial intelligenceTransliterationNatural language processingLanguage industryLanguage technology

摘要: Sindhi, an Indo-Aryan language with more than 75 million native speakers 1 is a resourcepoor in terms of the availability technology tools and resources. In this thesis, we discuss approaches taken to develop resources for special focus on Sindhi. The major contributions work include raw annotated datasets, POS Tagger, Morphological Analyser, Transliteration Machine Translation System.

参考文章(26)
Chris Callison-Burch, David Yarowsky, Ann Irvine, Alexandre Klementiev, Toward Statistical Machine Translation without Parallel Corpora conference of the european chapter of the association for computational linguistics. pp. 130- 140 ,(2012)
Riyaz Ahmad Bhat, Dipti Misra Sharma, Dependency Treebank of Urdu and its Evaluation linguistic annotation workshop. pp. 157- 165 ,(2012)
Chris Callison-Burch, Ann Irvine, Combining Bilingual and Comparable Corpora for Low Resource Machine Translation workshop on statistical machine translation. pp. 262- 270 ,(2013)
Sittichai Jiampojamarn, Grzegorz Kondrak, Shane Bergsma, Qing Dou, Kenneth Dwyer, Aditya Bhargava, Mi-Young Kim, Transliteration Generation and Mining with Limited Training Resources meeting of the association for computational linguistics. pp. 39- 47 ,(2010)
Mitesh M. Khapra, A Kumaran, Haizhou Li, Report of NEWS 2010 Transliteration Mining Shared Task meeting of the association for computational linguistics. pp. 21- 28 ,(2010)
Jakob Uszkoreit, Oscar Täckström, Ryan McDonald, Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure north american chapter of the association for computational linguistics. pp. 477- 487 ,(2012)
Franz Josef Och, Hermann Ney, Improved statistical alignment models Proceedings of the 38th Annual Meeting on Association for Computational Linguistics - ACL '00. pp. 440- 447 ,(2000) , 10.3115/1075218.1075274
Aarne Ranta, GF: A Multilingual Grammar Formalism Language and Linguistics Compass. ,vol. 3, pp. 1242- 1265 ,(2009) , 10.1111/J.1749-818X.2009.00155.X
David Yarowsky, Grace Ngai, Richard Wicentowski, Inducing multilingual text analysis tools via robust projection across aligned corpora Proceedings of the first international conference on Human language technology research - HLT '01. pp. 1- 8 ,(2001) , 10.3115/1072133.1072187
Javed Ahmed Mahar, Ghulam Qadir Memon, Rule Based Part of Speech Tagging of Sindhi Language international conference on signal acquisition and processing. pp. 101- 106 ,(2010) , 10.1109/ICSAP.2010.27