作者: Imed Zitouni , Jeffrey S. Sorensen , Ruhi Sarikaya
关键词:
摘要: Short vowels and other diacritics are not part of written Arabic scripts. Exceptions made for important political religious texts in scripts beginning students Arabic. Script without have considerable ambiguity because many words with different diacritic patterns appear identical a diacritic-less setting. We propose this paper maximum entropy approach restoring document. The can easily integrate make effective use diverse types information; the model we integrates wide array lexical, segment-based part-of-speech tag features. combination these feature leads to state-of-the-art diacritization model. Using publicly available corpus (LDC's Treebank Part 3), achieve error rate 5.1%, segment 8.5%, word 17.3%. In case-ending-less setting, obtain 2.2%, 4.0%, 7.2%.