Collecting Highly Parallel Data for Paraphrase Evaluation

作者: David Chen , William B Dolan

DOI:

关键词: Data miningParaphraseArtificial intelligenceComputer scienceMeasure (data warehouse)Simple (philosophy)Scale (map)Natural language processingField (computer science)Machine translation

摘要: A lack of standard datasets and evaluation metrics has prevented the field paraphrasing from making kind rapid progress enjoyed by machine translation community over last 15 years. We address both problems presenting a novel data collection framework that produces highly parallel text relatively inexpensively on large scale. The nature this allows us to use simple n-gram comparisons measure semantic adequacy lexical dissimilarity paraphrase candidates. In addition being efficient compute, experiments show these correlate with human judgments.

参考文章(32)
Chris Brockett, Stanley Kok, Hitting the Right Paraphrases in Good Time north american chapter of the association for computational linguistics. pp. 145- 153 ,(2010)
William B. Dolan, Chris Brockett, Chris Quirk, Monolingual Machine Translation for Paraphrase Generation empirical methods in natural language processing. pp. 142- 149 ,(2004)
Michael Denkowski, Alon Lavie, Exploring Normalization Techniques for Human Judgments of Machine Translation Adequacy Collected Using Amazon Mechanical Turk north american chapter of the association for computational linguistics. pp. 57- 61 ,(2010)
Yusuke Shinyama, Satoshi Sekine, Kiyoshi Sudo, Ralph Grishman, Automatic paraphrase acquisition from news articles international conference on human language technology research. pp. 313- 318 ,(2002)
Ann Irvine, Alexandre Klementiev, Using Mechanical Turk to Annotate Lexicons for Less Commonly Used Languages north american chapter of the association for computational linguistics. pp. 108- 113 ,(2010)
Philip Resnik, Ben Bederson, Olivia Buzek, Error Driven Paraphrase Annotation using Mechanical Turk north american chapter of the association for computational linguistics. pp. 217- 221 ,(2010)
Dekang Lin, Patrick Pantel, DIRT @SBT@discovery of inference rules from text knowledge discovery and data mining. pp. 323- 328 ,(2001) , 10.1145/502512.502559
Bill Dolan, Chris Quirk, Chris Brockett, Unsupervised construction of large paraphrase corpora Proceedings of the 20th international conference on Computational Linguistics - COLING '04. pp. 350- 356 ,(2004) , 10.3115/1220355.1220406
Ali Ibrahim, Boris Katz, Jimmy Lin, Extracting Structural Paraphrases from Aligned Monolingual Corpora meeting of the association for computational linguistics. pp. 57- 64 ,(2003) , 10.3115/1118984.1118992