How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

作者： Laurent Charlin , Joelle Pineau , Iulian V. Serban , Ryan Lowe , Michael Noseworthy

DOI:

关键词:

摘要: We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in have adopted from machine translation to compare a model's generated single target response. show that these correlate very weakly with human judgements the non-technical Twitter domain, and at all technical Ubuntu domain. provide quantitative qualitative results highlighting specific weaknesses existing metrics, recommendations future development of better automatic systems.

arxiv.org 本地加速

arxiv.org PDF 下载加速

参考文章(42)

Colin Cherry, William B. Dolan, Alan Ritter, Data-Driven Response Generation in Social Media empirical methods in natural language processing. pp. 583- 593 ,(2011)

Antti Oulasvirta, Verena Vanessa Hafner, Alexander Raake, Klaus-Peter Engelbrecht, Sebastian Möller, Anthony Jameson, Roman Englert, Norbert Reithinger, Memo: towards automatic usability evaluation of spoken dialogue services by user error simulations. conference of the international speech communication association. ,(2006)

Ryan Lowe, Nissan Pow, Joelle Pineau, Iulian Serban, The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems arXiv: Computation and Language. ,(2015)

Chris Callison-Burch, Miles Osborne, Philipp Koehn, Re-evaluating the Role of Bleu in Machine Translation Research conference of the european chapter of the association for computational linguistics. pp. 249- 256 ,(2006)

Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, Bill Dolan, A Neural Network Approach to Context-Sensitive Generation of Conversational Responses north american chapter of the association for computational linguistics. pp. 196- 205 ,(2015) , 10.3115/V1/N15-1020

Armand Joulin, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Tomas Mikolov, Jason Weston, Antoine Bordes, Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks arXiv: Artificial Intelligence. ,(2015)

Amanda Stent, Matthew Marge, Mohit Singhai, Evaluating Evaluation Methods for Generation in the Presence of Variation Computational Linguistics and Intelligent Text Processing. pp. 341- 351 ,(2005) , 10.1007/978-3-540-30586-6_38

Quoc V. Le, Oriol Vinyals, A Neural Conversational Model arXiv: Computation and Language. ,(2015)

Colin Cherry, Alan Ritter, Bill Dolan, Unsupervised Modeling of Twitter Conversations north american chapter of the association for computational linguistics. pp. 172- 180 ,(2010)

10.

Alex Graves, Generating Sequences With Recurrent Neural Networks arXiv: Neural and Evolutionary Computing. ,(2013)

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

来源期刊

我的账户

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

来源期刊

相似文章 10

我的账户