作者: David Martin Ward Powers , None
DOI:
关键词: Kappa 、 Context (language use) 、 Fleiss' kappa 、 Statistics 、 Computer science 、 Set (abstract data type) 、 Skew 、 Receiver operating characteristic 、 Cohen's kappa 、 Algorithm 、 Computational linguistics
摘要: It is becoming clear that traditional evaluation measures used in Computational Linguistics (including Error Rates, Accuracy, Recall, Precision and F-measure) are of limited value for unbiased systems, not meaningful comparison algorithms unless both the dataset algorithm parameters strictly controlled skew (Prevalence Bias). The use techniques originally designed other purposes, particular Receiver Operating Characteristics Area Under Curve, plus variants Kappa, have been proposed to fill void. This paper aims up some confusion relating evaluation, by demonstrating usefulness each method highly dependent on assumptions made about distributions underlying populations. behaviour a number compared under common assumptions. Deploying system context which has opposite from its validation set can be expected approximately negate Fleiss Kappa halve Cohen but leave Powers unchanged. For most performance latter thus appropriate, whilst behaviour, Matthews Correlation recommended.