Alternatives to Bpref

作者: Tetsuya Sakai

DOI: 10.1145/1277741.1277756

关键词:

摘要: Recently, a number of TREC tracks have adopted retrieval effectiveness metric called bpref which has been designed for evaluation environments with incomplete relevance data. A graded-relevance version this rpref also proposed. However, we show that the application Q-measure, normalised Discounted Cumulative Gain (nDCG) or Average Precision (AveP)to condensed lists, obtained by ?ltering out all unjudged documents from original ranked is actually better solution to incompleteness problem than bpref. Furthermore, use graded boosts robustness IR and therefore Q-measure nDCG based on lists are best choices. To end, four test collections NTCIR compare ten different metrics in terms system ranking stability pairwise discriminative power.

参考文章(16)
Emine Yilmaz, Javed A. Aslam, Estimating average precision with incomplete and imperfect judgments Proceedings of the 15th ACM international conference on Information and knowledge management - CIKM '06. pp. 102- 111 ,(2006) , 10.1145/1183614.1183633
Ian Soboroff, Charles Nicholas, Patrick Cahan, Ranking retrieval systems without relevance judgments international acm sigir conference on research and development in information retrieval. pp. 66- 73 ,(2001) , 10.1145/383952.383961
Ellen M. Voorhees, Chris Buckley, The effect of topic set size on retrieval experiment error Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '02. pp. 316- 323 ,(2002) , 10.1145/564376.564432
Javed A. Aslam, Robert Savell, On the effectiveness of evaluating retrieval systems in the absence of relevance judgments international acm sigir conference on research and development in information retrieval. pp. 361- 362 ,(2003) , 10.1145/860435.860501
Javed A. Aslam, Virgil Pavlu, Emine Yilmaz, A statistical method for system evaluation using incomplete judgments Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06. pp. 541- 548 ,(2006) , 10.1145/1148170.1148263
Gordon V. Cormack, Christopher R. Palmer, Charles L. A. Clarke, Efficient construction of large test collections international acm sigir conference on research and development in information retrieval. pp. 282- 289 ,(1998) , 10.1145/290941.291009
Tetsuya Sakai, Evaluating evaluation metrics based on the bootstrap Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06. pp. 525- 532 ,(2006) , 10.1145/1148170.1148261
Kalervo Järvelin, Jaana Kekäläinen, Cumulated gain-based evaluation of IR techniques ACM Transactions on Information Systems. ,vol. 20, pp. 422- 446 ,(2002) , 10.1145/582415.582418
Justin Zobel, How reliable are the results of large-scale information retrieval experiments? international acm sigir conference on research and development in information retrieval. pp. 307- 314 ,(1998) , 10.1145/290941.291014
Ian Soboroff, Dynamic test collections Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06. pp. 276- 283 ,(2006) , 10.1145/1148170.1148220