作者: Ciprian Chelba , Jeff Klingner , Junpei Zhou , Hideto Kazawa , Mengmeng Niu
DOI:
关键词:
摘要: The paper investigates the feasibility of confidence estimation for neural machine translation models operating at high end performance spectrum. As a side product data annotation process necessary building such we propose sentence level accuracy $SACC$ as simple, self-explanatory evaluation metric quality translation. Experiments on two different annotator pools, one comprised non-expert (crowd-sourced) and expert (professional) translators show that can vary greatly depending proficiency annotators, despite fact both pools are about equally reliable according to Krippendorff's alpha metric; relatively low values inter-annotator agreement confirm expectation sentence-level binary labeling $good$ / $needs\ work$ out context is very hard. For an English-Spanish model $SACC = 0.89$ pool derive estimate labels 0.5-0.6 translations in "in-domain" test set with 0.95 Precision. Switching decreases dramatically: $0.61$ English-Spanish, measured exact same above. This forces us lower CE point 0.9 Precision while correctly 0.20-0.25 data. We find surprising extent which depends used leads important recommendation wish make when tackling modeling practice: it critical match end-user desired domain demands annotators assigning training