作者: Gábor Recski , Richard Csaky
DOI:
关键词: Artificial intelligence 、 Process (engineering) 、 Heuristics 、 Portuguese 、 Natural language processing 、 German 、 Computer science 、 Pipeline (software) 、 Quality (business) 、 Error analysis
摘要: Large datasets are essential for neural modeling of many NLP tasks. Current publicly available open-domain dialogue offer a trade-off between quality (e.g., DailyDialog) and size Opensubtitles). We narrow this gap by building high-quality dataset 14.8M utterances in English, smaller German, Dutch, Spanish, Portuguese, Italian, Hungarian. extract process dialogues from public-domain books made Project Gutenberg. describe our extraction pipeline, analyze the effects various heuristics used, present an error analysis extracted dialogues. Finally, we conduct experiments showing that better response can be achieved zero-shot finetuning settings training on data than larger but much noisier Opensubtitles dataset. Our open-source pipeline (https://github.com/ricsinaruto/gutenberg-dialog) extended to further languages with little additional effort. Researchers also build their versions existing adjusting parameters.