A discriminative approach to filter out noisy sentence pairs from bilingual corpora

作者: Kaveh Taghipour , Nasim Afhami , Shahram Khadivi , Saeed Shiry

DOI: 10.1109/ISTEL.2010.5734083

关键词:

摘要: Parallel corpora are essential for training statistical machine translation models. Since parallel sentence-aligned usually noisy due to inexact automatic methods when generated from or comparable documents, we need clean corpora. In this paper, new features introduced assess the correctness of a sentence pair. Also, impact in combination with state-of-the-art literature is systematically evaluated. Statistical have been used feature extraction and therefore approach independent language. order better understand problem characteristics, four supervised classification algorithms classify pairs as noise parallel. Evaluating models by taking accuracy f-measure into account shows that using system cleaning Farsi-English corpus, maximum entropy model performs than main filtering techniques paper significant improvement over two other systems.

参考文章(25)
William A. Gale, Kenneth W. Church, A program for aligning sentences in bilingual corpora Computational Linguistics. ,vol. 19, pp. 75- 102 ,(1993) , 10.5555/972450.972455
I. Dan Melamed, Bitext maps and alignment via pattern recognition Computational Linguistics. ,vol. 25, pp. 107- 130 ,(1999)
Shahram Khadivi, Hermann Ney, Automatic Filtering of Bilingual Corpora for Statistical Machine Translation Natural Language Processing and Information Systems. pp. 263- 274 ,(2005) , 10.1007/11428817_24
Haizhou Li, Min Zhang, Aiti Aw, Lianhau Lee, EM-based Hybrid Model for Bilingual Terminology Extraction from Comparable Corpora international conference on computational linguistics. pp. 639- 646 ,(2010)
Andreas Stolcke, SRILM – An Extensible Language Modeling Toolkit conference of the international speech communication association. ,(2002)
Robert C. Moore, Fast and Accurate Sentence Alignment of Bilingual Corpora conference of the association for machine translation in the americas. pp. 135- 144 ,(2002) , 10.1007/3-540-45820-4_14
Sadaf AbduI-Rauf, Holger Schwenk, On the Use of Comparable Corpora to Improve SMT performance meeting of the association for computational linguistics. pp. 16- 23 ,(2009) , 10.3115/1609067.1609068
Franz Josef Och, Hermann Ney, Improved statistical alignment models Proceedings of the 38th Annual Meeting on Association for Computational Linguistics - ACL '00. pp. 440- 447 ,(2000) , 10.3115/1075218.1075274
Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer, Peter F. Brown, The mathematics of statistical machine translation: parameter estimation Computational Linguistics. ,vol. 19, pp. 263- 311 ,(1993)
Jiang Chen, Jian-Yun Nie, Automatic construction of parallel English-Chinese corpus for cross-language information retrieval conference on applied natural language processing. pp. 21- 28 ,(2000) , 10.3115/974147.974151