Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking

作者： Gabriella Kazai , Jaap Kamps , Marijn Koolen , Natasa Milic-Frayling

DOI: 10.1145/2009916.2009947

关键词:

摘要: The evaluation of information retrieval (IR) systems over special collections, such as large book repositories, is out reach traditional methods that rely upon editorial relevance judgments. Increasingly, the use crowdsourcing to collect labels has been regarded a viable alternative scales with modest costs. However, suffers from undesirable worker practices and low quality contributions. In this paper we investigate design implementation effective tasks in context search evaluation. We observe impact aspects Human Intelligence Task (HIT) on provided by crowd. assess output terms label agreement gold standard data set effect crowdsourced judgments resulting system rankings. This enables us entire IR process. Using test experimental runs INEX 2010 Book Track, find varying HIT design, pooling document ordering strategies leads considerable differences labels. then sets relative rankings using four performance metrics. System based MAP Bpref remain less affected different while Precision@10 nDCG@10 lead dramatically rankings, especially for acquired HITs weaker controls. Overall, can be an tool systems, care taken when designing HITs.

uva.nl 本地加速

acm.org 本地加速

researchgate.net PDF 下载加速

参考文章(28)

Gabriella Kazai, Marijn Koolen, Jaap Kamps, Antoine Doucet, Monica Landoni, Overview of the INEX 2010 book track: scaling up the evaluation using crowdsourcing INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval. pp. 98- 117 ,(2010) , 10.1007/978-3-642-23577-1_9

Omar Alonso, Ricardo Baeza-Yates, Design and Implementation of Relevance Assessments Using Crowdsourcing Lecture Notes in Computer Science. pp. 153- 164 ,(2011) , 10.1007/978-3-642-20161-5_16

Gabriella Kazai, In Search of Quality in Crowdsourcing for Search Engine Evaluation Lecture Notes in Computer Science. pp. 165- 176 ,(2011) , 10.1007/978-3-642-20161-5_17

Jeff Howe, Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business ,(2008)

CYRIL CLEVERDON, The Cranfield tests on index language devices Aslib Proceedings. ,vol. 19, pp. 47- 59 ,(1997) , 10.1108/EB050097

Rion Snow, Brendan O'Connor, Daniel Jurafsky, Andrew Y. Ng, Cheap and fast---but is it good? Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP '08. pp. 254- 263 ,(2008) , 10.3115/1613715.1613751

Ian Soboroff, Charles Nicholas, Patrick Cahan, Ranking retrieval systems without relevance judgments international acm sigir conference on research and development in information retrieval. pp. 66- 73 ,(2001) , 10.1145/383952.383961

Abraham Naftali Oppenheim, Questionnaire Design, Interviewing and Attitude Measurement ,(1992)

Stefanie Nowak, Stefan Rüger, How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation multimedia information retrieval. pp. 557- 566 ,(2010) , 10.1145/1743384.1743478

10.

Gabriella Kazai, Natasa Milic-Frayling, Jamie Costello, Towards methods for the collective gathering and quality control of relevance assessments international acm sigir conference on research and development in information retrieval. pp. 452- 459 ,(2009) , 10.1145/1571941.1572019

Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking

来源期刊

我的账户

Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking

来源期刊

相似文章 10

我的账户