作者: Kaveh Taghipour , Nasim Afhami , Shahram Khadivi , Saeed Shiry
DOI: 10.1109/ISTEL.2010.5734083
关键词:
摘要: Parallel corpora are essential for training statistical machine translation models. Since parallel sentence-aligned usually noisy due to inexact automatic methods when generated from or comparable documents, we need clean corpora. In this paper, new features introduced assess the correctness of a sentence pair. Also, impact in combination with state-of-the-art literature is systematically evaluated. Statistical have been used feature extraction and therefore approach independent language. order better understand problem characteristics, four supervised classification algorithms classify pairs as noise parallel. Evaluating models by taking accuracy f-measure into account shows that using system cleaning Farsi-English corpus, maximum entropy model performs than main filtering techniques paper significant improvement over two other systems.