作者: Stefanie Nowak , Stefan Rüger
关键词:
摘要: The creation of golden standard datasets is a costly business. Optimally more than one judgment per document obtained to ensure high quality on annotations. In this context, we explore how much annotations from experts differ each other, different sets influence the ranking systems and if these can be with crowdsourcing approach. This study applied images multiple concepts. A subset employed in latest ImageCLEF Photo Annotation competition was manually annotated by expert annotators non-experts Mechanical Turk. inter-annotator agreement computed at an image-based concept-based level using majority vote, accuracy kappa statistics. Further, Kendall τ Kolmogorov-Smirnov correlation test used compare regarding ground-truths evaluation measures benchmark scenario. Results show that while between varies depending measure used, its ranked lists rather small. To sum up, vote generate annotation set out several opinions, able filter noisy judgments some extent. resulting comparable experts.