Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora

作者: Luisa Bentivogli , Yashar Mehdad , Danilo Giampiccolo , Alessandro Marchetti , Matteo Negri

DOI:

关键词:

摘要: We address the creation of cross-lingual textual entailment corpora by means crowd-sourcing. Our goal is to define a cheap and replicable data collection methodology that minimizes manual work done expert annotators, without resorting preprocessing tools or already annotated monolingual datasets. In line with recent works emphasizing need large-scale annotation efforts for entailment, our aims to: i) tackle scarcity available train evaluate systems, ii) promote recourse crowdsourcing as an effective way reduce costs sacrificing quality. show complex task, which even experts usually feature low agreement scores, can be effectively decomposed into simple subtasks assigned non-expert annotators. The resulting dataset, obtained from pipeline different jobs routed Amazon Mechanical Turk, contains more than 1,600 aligned pairs each combination texts-hypotheses in English, Italian German.

参考文章(14)
Bernardo Magnini, Luisa Bentivogli, Ido Dagan, Elena Cabrio, Danilo Giampiccolo, Medea Lo Leggio, Building Textual Entailment Specialized Data Sets: a Methodology for Isolating Linguistic Phenomena Relevant to Inference language resources and evaluation. ,(2010)
Fabio Massimo Zanzotto, Johan Bos, Marco Pennacchiotti, Textual Entailment at EVALITA 2009 ,(2009)
Rion Snow, Brendan O'Connor, Daniel Jurafsky, Andrew Y. Ng, Cheap and fast---but is it good? Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP '08. pp. 254- 263 ,(2008) , 10.3115/1613715.1613751
Rada Mihalcea, Carlo Strapparava, The Lie Detector: Explorations in the Automatic Recognition of Deceptive Language meeting of the association for computational linguistics. pp. 309- 312 ,(2009) , 10.3115/1667583.1667679
Yashar Mehdad, Marcello Federico, Matteo Negri, Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment meeting of the association for computational linguistics. pp. 1336- 1345 ,(2011)
Chris Callison-Burch, Mark Dredze, Creating Speech and Language Data With Amazon's Mechanical Turk north american chapter of the association for computational linguistics. pp. 1- 12 ,(2010)
Chris Callison-Burch, Michael Bloodgood, Using Mechanical Turk to Build Machine Translation Evaluation Sets north american chapter of the association for computational linguistics. pp. 208- 211 ,(2010)
Chris Callison-Burch, Rui Wang, Cheap Facts and Counter-Facts north american chapter of the association for computational linguistics. pp. 163- 167 ,(2010)
Sadaoki Furui, Joanna Mrozinski, Edward Whittaker, Collecting a Why-Question Corpus for Development and Evaluation of an Automatic QA-System meeting of the association for computational linguistics. pp. 443- 451 ,(2008)