The Gutenberg Dialogue Dataset.

作者: Gábor Recski , Richard Csaky

DOI:

关键词: Artificial intelligenceProcess (engineering)HeuristicsPortugueseNatural language processingGermanComputer sciencePipeline (software)Quality (business)Error analysis

摘要: Large datasets are essential for neural modeling of many NLP tasks. Current publicly available open-domain dialogue offer a trade-off between quality (e.g., DailyDialog) and size Opensubtitles). We narrow this gap by building high-quality dataset 14.8M utterances in English, smaller German, Dutch, Spanish, Portuguese, Italian, Hungarian. extract process dialogues from public-domain books made Project Gutenberg. describe our extraction pipeline, analyze the effects various heuristics used, present an error analysis extracted dialogues. Finally, we conduct experiments showing that better response can be achieved zero-shot finetuning settings training on data than larger but much noisier Opensubtitles dataset. Our open-source pipeline (https://github.com/ricsinaruto/gutenberg-dialog) extended to further languages with little additional effort. Researchers also build their versions existing adjusting parameters.

参考文章(37)
J"org Tiedemann, Parallel Data, Tools and Interfaces in OPUS language resources and evaluation. pp. 2214- 2218 ,(2012)
Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization arXiv: Learning. ,(2014)
Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu, BLEU Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL '02. pp. 311- 318 ,(2001) , 10.3115/1073083.1073135
Cristian Danescu-Niculescu-Mizil, Lillian Lee, Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs meeting of the association for computational linguistics. pp. 76- 87 ,(2011)
J.J. Godfrey, E.C. Holliman, J. McDaniel, SWITCHBOARD: telephone speech corpus for research and development international conference on acoustics, speech, and signal processing. ,vol. 1, pp. 517- 520 ,(1992) , 10.1109/ICASSP.1992.225858
Laurent Charlin, Joelle Pineau, Iulian V. Serban, Ryan Lowe, Michael Noseworthy, Chia-Wei Liu, How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation arXiv: Computation and Language. ,(2016)
Jiwei Li, Will Monroe, Tianlin Shi, Sėbastien Jean, Alan Ritter, Dan Jurafsky, Adversarial Learning for Neural Dialogue Generation empirical methods in natural language processing. pp. 2157- 2169 ,(2017) , 10.18653/V1/D17-1230
Joachim Fainberg, Emmanuel Kahembwe, Marco Damonte, Jianpeng Cheng, Daniel Duma, Bonnie L. Webber, Federico Fancellu, Ben Krause, Mihai Dobre, Edina: Building an Open Domain Socialbot with Self-dialogues. arXiv: Computation and Language. ,(2017)
Ashwin Ram, Rohit Prasad, Chandra Khatri, Eric King, Ashish Nagar, Gene Hwang, Raefer Gabriel, Han Song, Anushree Venkatesh, Ming Cheng, Yi Pan, Art Pettigrue, Jeff Nunn, Sk Jayadevan, Amanda Wartick, Behnam Hedayatnia, Qing Liu, Kate Bland, Conversational AI: The Science Behind the Alexa Prize. arXiv: Artificial Intelligence. ,(2017)
Yu Wu, Wei Wu, Can Xu, Towards Explainable and Controllable Open Domain Dialogue Generation with Dialogue Acts arXiv: Computation and Language. ,(2018)