How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation

作者: Stefanie Nowak , Stefan Rüger

DOI: 10.1145/1743384.1743478

关键词:

摘要: The creation of golden standard datasets is a costly business. Optimally more than one judgment per document obtained to ensure high quality on annotations. In this context, we explore how much annotations from experts differ each other, different sets influence the ranking systems and if these can be with crowdsourcing approach. This study applied images multiple concepts. A subset employed in latest ImageCLEF Photo Annotation competition was manually annotated by expert annotators non-experts Mechanical Turk. inter-annotator agreement computed at an image-based concept-based level using majority vote, accuracy kappa statistics. Further, Kendall τ Kolmogorov-Smirnov correlation test used compare regarding ground-truths evaluation measures benchmark scenario. Results show that while between varies depending measure used, its ranked lists rather small. To sum up, vote generate annotation set out several opinions, able filter noisy judgments some extent. resulting comparable experts.

参考文章(22)
Thorsten Brants, Inter-annotator Agreement for a German Newspaper Corpus language resources and evaluation. ,(2000)
Timothy A. Chklovski, Rada Mihalcea, Exploiting Agreement and Disagreement of Human Annotators for Word Sense Disambiguation recent advances in natural language processing. ,(2003)
Rion Snow, Brendan O'Connor, Daniel Jurafsky, Andrew Y. Ng, Cheap and fast---but is it good? Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP '08. pp. 254- 263 ,(2008) , 10.3115/1613715.1613751
M. G. KENDALL, A NEW MEASURE OF RANK CORRELATION Biometrika. ,vol. 30, pp. 81- 93 ,(1938) , 10.1093/BIOMET/30.1-2.81
Adam Kilgarriff, Gold standard datasets for evaluating word sense disambiguation programs Computer Speech & Language. ,vol. 12, pp. 453- 472 ,(1998) , 10.1006/CSLA.1998.0108
Jacob Cohen, A Coefficient of agreement for nominal Scales Educational and Psychological Measurement. ,vol. 20, pp. 37- 46 ,(1960) , 10.1177/001316446002000104
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei, ImageNet: A large-scale hierarchical image database computer vision and pattern recognition. pp. 248- 255 ,(2009) , 10.1109/CVPR.2009.5206848
Ellen M. Voorhees, Variations in relevance judgments and the measurement of retrieval effectiveness Information Processing & Management. ,vol. 36, pp. 697- 716 ,(2000) , 10.1016/S0306-4573(00)00010-8