Automatic Parallel Fragment Extraction from Noisy Data

作者: Daniel Marcu , Jason Riesa

DOI:

关键词:

摘要: We present a novel method to detect parallel fragments within noisy corpora. Isolating these from the data in which they are contained frees us alignments and stray links that can severely constrain translation-rule extraction. do this with existing machinery, making use of an word alignment model for task. evaluate quality utility extracted on large-scale Chinese-English Arabic-English translation tasks show significant improvements over state-of-the-art baseline.

参考文章(11)
Pascale Fung, Percy Cheung, Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E empirical methods in natural language processing. pp. 57- 63 ,(2004)
Arul Menezes, Chris Quirk, Raghavendra Udupa U, Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction European Association for Machine Translation. ,(2007)
Daniel Marcu, Jason Riesa, Ann Irvine, Feature-Rich Language-Independent Syntax-Based Alignment for Statistical Machine Translation empirical methods in natural language processing. pp. 497- 507 ,(2011)
Dekai Wu, Pascale Fung, Inversion Transduction Grammar Constraints for Mining Parallel Sentences from Quasi-Comparable Corpora Lecture Notes in Computer Science. ,vol. 3651, pp. 257- 268 ,(2005) , 10.1007/11562214_23
Michel Galley, Mark Hopkins, Kevin Knight, Daniel Marcu, What’s in a translation rule? north american chapter of the association for computational linguistics. pp. 273- 280 ,(2004) , 10.21236/ADA460212
Moshe Dubiner, Ashok Popat, Jay Ponte, Jakob Uszkoreit, Large Scale Parallel Document Mining for Machine Translation international conference on computational linguistics. pp. 1101- 1109 ,(2010)
Bing Zhao, S. Vogel, Adaptive parallel sentences mining from web bilingual news collection international conference on data mining. pp. 745- 748 ,(2002) , 10.1109/ICDM.2002.1184044
David Chiang, Yuval Marton, Philip Resnik, Online large-margin training of syntactic and structural translation features Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP '08. pp. 224- 233 ,(2008) , 10.3115/1613715.1613747
Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein, Learning Accurate, Compact, and Interpretable Tree Annotation meeting of the association for computational linguistics. pp. 433- 440 ,(2006) , 10.3115/1220175.1220230
Dragos Stefan Munteanu, Daniel Marcu, Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora meeting of the association for computational linguistics. pp. 81- 88 ,(2006) , 10.3115/1220175.1220186