Maximum Entropy Based Restoration of Arabic Diacritics

作者: Imed Zitouni , Jeffrey S. Sorensen , Ruhi Sarikaya

DOI: 10.3115/1220175.1220248

关键词:

摘要: Short vowels and other diacritics are not part of written Arabic scripts. Exceptions made for important political religious texts in scripts beginning students Arabic. Script without have considerable ambiguity because many words with different diacritic patterns appear identical a diacritic-less setting. We propose this paper maximum entropy approach restoring document. The can easily integrate make effective use diverse types information; the model we integrates wide array lexical, segment-based part-of-speech tag features. combination these feature leads to state-of-the-art diacritization model. Using publicly available corpus (LDC's Treebank Part 3), achieve error rate 5.1%, segment 8.5%, word 17.3%. In case-ending-less setting, obtain 2.2%, 4.0%, 7.2%.

参考文章(18)
Adwait Ratnaparkhi, A Maximum Entropy Model for Part-Of-Speech Tagging empirical methods in natural language processing. ,(1996)
Andrew McCallum, Dayne Freitag, Fernando C. N. Pereira, Maximum Entropy Markov Models for Information Extraction and Segmentation international conference on machine learning. pp. 591- 598 ,(2000)
Murat Tayli, Abdulah I. Al-Salamah, Building bilingual microcomputer systems Communications of the ACM. ,vol. 33, pp. 495- 504 ,(1990) , 10.1145/78607.78610
Imed Zitouni, Jeff Sorensen, Xiaoqiang Luo, Radu Florian, The impact of morphological stemming on Arabic mention detection and coreference resolution Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages - Semitic '05. pp. 63- 70 ,(2005) , 10.3115/1621787.1621800
Katrin Kirchhoff, Dimitra Vergyri, Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition Speech Communication. ,vol. 46, pp. 37- 51 ,(2005) , 10.1016/J.SPECOM.2005.01.004
Dimitra Vergyri, Katrin Kirchhoff, Automatic diacritization of Arabic for acoustic modeling in speech recognition Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages - Semitic '04. pp. 66- 73 ,(2004) , 10.3115/1621804.1621822
Yousif A. El-Imam, Phonetization of Arabic: rules and algorithms Computer Speech & Language. ,vol. 18, pp. 339- 373 ,(2004) , 10.1016/S0885-2308(03)00035-4
Rani Nelken, Stuart M. Shieber, Arabic diacritization using weighted finite-state transducers Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages - Semitic '05. pp. 79- 86 ,(2005) , 10.3115/1621787.1621802
Ya'akov Gal, An HMM Approach to Vowel Restoration in Arabic and Hebrew meeting of the association for computational linguistics. pp. 1- 7 ,(2002) , 10.3115/1118637.1118641