作者: Gabriella Kazai , Jaap Kamps , Marijn Koolen , Natasa Milic-Frayling
关键词:
摘要: The evaluation of information retrieval (IR) systems over special collections, such as large book repositories, is out reach traditional methods that rely upon editorial relevance judgments. Increasingly, the use crowdsourcing to collect labels has been regarded a viable alternative scales with modest costs. However, suffers from undesirable worker practices and low quality contributions. In this paper we investigate design implementation effective tasks in context search evaluation. We observe impact aspects Human Intelligence Task (HIT) on provided by crowd. assess output terms label agreement gold standard data set effect crowdsourced judgments resulting system rankings. This enables us entire IR process. Using test experimental runs INEX 2010 Book Track, find varying HIT design, pooling document ordering strategies leads considerable differences labels. then sets relative rankings using four performance metrics. System based MAP Bpref remain less affected different while Precision@10 nDCG@10 lead dramatically rankings, especially for acquired HITs weaker controls. Overall, can be an tool systems, care taken when designing HITs.