Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking

作者: Gabriella Kazai , Jaap Kamps , Marijn Koolen , Natasa Milic-Frayling

DOI: 10.1145/2009916.2009947

关键词:

摘要: The evaluation of information retrieval (IR) systems over special collections, such as large book repositories, is out reach traditional methods that rely upon editorial relevance judgments. Increasingly, the use crowdsourcing to collect labels has been regarded a viable alternative scales with modest costs. However, suffers from undesirable worker practices and low quality contributions. In this paper we investigate design implementation effective tasks in context search evaluation. We observe impact aspects Human Intelligence Task (HIT) on provided by crowd. assess output terms label agreement gold standard data set effect crowdsourced judgments resulting system rankings. This enables us entire IR process. Using test experimental runs INEX 2010 Book Track, find varying HIT design, pooling document ordering strategies leads considerable differences labels. then sets relative rankings using four performance metrics. System based MAP Bpref remain less affected different while Precision@10 nDCG@10 lead dramatically rankings, especially for acquired HITs weaker controls. Overall, can be an tool systems, care taken when designing HITs.

参考文章(28)
Gabriella Kazai, Marijn Koolen, Jaap Kamps, Antoine Doucet, Monica Landoni, Overview of the INEX 2010 book track: scaling up the evaluation using crowdsourcing INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval. pp. 98- 117 ,(2010) , 10.1007/978-3-642-23577-1_9
Omar Alonso, Ricardo Baeza-Yates, Design and Implementation of Relevance Assessments Using Crowdsourcing Lecture Notes in Computer Science. pp. 153- 164 ,(2011) , 10.1007/978-3-642-20161-5_16
Gabriella Kazai, In Search of Quality in Crowdsourcing for Search Engine Evaluation Lecture Notes in Computer Science. pp. 165- 176 ,(2011) , 10.1007/978-3-642-20161-5_17
CYRIL CLEVERDON, The Cranfield tests on index language devices Aslib Proceedings. ,vol. 19, pp. 47- 59 ,(1997) , 10.1108/EB050097
Rion Snow, Brendan O'Connor, Daniel Jurafsky, Andrew Y. Ng, Cheap and fast---but is it good? Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP '08. pp. 254- 263 ,(2008) , 10.3115/1613715.1613751
Ian Soboroff, Charles Nicholas, Patrick Cahan, Ranking retrieval systems without relevance judgments international acm sigir conference on research and development in information retrieval. pp. 66- 73 ,(2001) , 10.1145/383952.383961
Gabriella Kazai, Natasa Milic-Frayling, Jamie Costello, Towards methods for the collective gathering and quality control of relevance assessments international acm sigir conference on research and development in information retrieval. pp. 452- 459 ,(2009) , 10.1145/1571941.1572019