作者: Laurent Charlin , Joelle Pineau , Iulian V. Serban , Ryan Lowe , Michael Noseworthy
DOI:
关键词:
摘要: We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in have adopted from machine translation to compare a model's generated single target response. show that these correlate very weakly with human judgements the non-technical Twitter domain, and at all technical Ubuntu domain. provide quantitative qualitative results highlighting specific weaknesses existing metrics, recommendations future development of better automatic systems.