作者: Andy Way , Peyman Passban , Dimitar Sht. Shterionov , Alberto Poncelas , Gideon Maillette de Buy Wenniger
DOI:
关键词:
摘要: A prerequisite for training corpus-based machine translation (MT) systems – either Statistical MT (SMT) or Neural (NMT) is the availability of high-quality parallel data. This arguably more important today than ever before, as NMT has been shown in many studies to outperform SMT, but mostly when large corpora are available; cases where data limited, SMT can still NMT. Recently researchers have that back-translating monolingual be used create synthetic corpora, which turn combination with authentic train a system. Given collections new text become available only quite rarely, back-translation norm building state-of-the-art systems, especially resource-poor scenarios. However, we assert there unknown factors regarding actual effects back-translated on capabilities an model. Accordingly, this work investigate how using corpus both separate standalone dataset well combined human-generated affects performance We use incrementally larger amounts range German-to-English, and analyse resulting performance.