Reliable information retrieval evaluation with incomplete and biased judgements

作者: Stefan Büttcher , Charles L. A. Clarke , Peter C. K. Yeung , Ian Soboroff

DOI: 10.1145/1277741.1277755

关键词: Artificial intelligenceComputer scienceRankingMachine learningPoolingInformation retrievalSet (psychology)Quality (business)Ranking (information retrieval)

摘要: Information retrieval evaluation based on the pooling method is inherently biased against systems that did not contribute to pool of judged documents. This may distort results obtained about relative quality evaluated and thus lead incorrect conclusions performance a particular ranking technique.We examine magnitude this effect explore how it can be countered by automatically building an unbiased set judgements from original, through pooling. We compare with other approaches problem incomplete judgements, such as bpref, show proposed leads higher accuracy, especially if manual rich in documents, but highly some systems.

参考文章(16)
Nick Craswell, Ian Soboroff, Charles L. A. Clarke, Overview of the TREC 2004 Terabyte Track. text retrieval conference. ,(2004)
Ellen M. Voorhees, The Philosophy of Information Retrieval Evaluation Lecture Notes in Computer Science. ,vol. 2406, pp. 355- 370 ,(2002) , 10.1007/3-540-45691-0_34
Stefan Büttcher, Ian Soboroff, Charles L. A. Clarke, The TREC 2006 Terabyte Track text retrieval conference. ,(2006)
Per Ahlgren, Leif Grönqvist, Retrieval evaluation with incomplete relevance data Proceedings of the 15th ACM international conference on Information and knowledge management - CIKM '06. pp. 872- 873 ,(2006) , 10.1145/1183614.1183773
CYRIL CLEVERDON, The Cranfield tests on index language devices Aslib Proceedings. ,vol. 19, pp. 47- 59 ,(1997) , 10.1108/EB050097
Emine Yilmaz, Javed A. Aslam, Estimating average precision with incomplete and imperfect judgments Proceedings of the 15th ACM international conference on Information and knowledge management - CIKM '06. pp. 102- 111 ,(2006) , 10.1145/1183614.1183633
M. G. KENDALL, A NEW MEASURE OF RANK CORRELATION Biometrika. ,vol. 30, pp. 81- 93 ,(1938) , 10.1093/BIOMET/30.1-2.81
Javed A. Aslam, Virgil Pavlu, Emine Yilmaz, A statistical method for system evaluation using incomplete judgments Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06. pp. 541- 548 ,(2006) , 10.1145/1148170.1148263
Kalervo Järvelin, Jaana Kekäläinen, Cumulated gain-based evaluation of IR techniques ACM Transactions on Information Systems. ,vol. 20, pp. 422- 446 ,(2002) , 10.1145/582415.582418