DOI: 10.1016/J.TRAC.2006.10.005
关键词:
摘要: Abstract This article discusses problems of validating classification models especially in datasets where sample sizes are small and the number variables is large. It describes use percentage correctly classified (%CC) as an indicator for success a model. For datasets, %CC should not be used uncritically its interpretation depends on size. illustrates common method, discriminant partial least squares (D-PLS) randomly generated dataset 200 samples variables. An aim classifier to determine whether null hypothesis (there no distinction between two classes) can rejected. Autoprediction gives 84.5% CC. shown that, if there variable selection, it must performed independently training set obtain CC close 50% test set; otherwise, over-optimistic false conclusions reached about ability classify into groups. Finally, aims determining quality model frequently confused, namely optimisation (often most appropriate components model) independent validation; overcome this, data split three There often difficulties with building validation have been done different groups samples, using iterative methods, each group being modelled properties, such or