作者: Sathi T. Marath , Michael Shepherd , Evangelos Milios , Jack Duffy
关键词:
摘要: This research investigates the design of a unified framework for content-based classification highly imbalanced hierarchical datasets, such as web directories. In an dataset, prior probability distribution category indicates presence or absence class imbalance. may include lack positive training instances (rarity) overabundance instances. We partitioned subcategories Yahoo! directory into five mutually exclusive groups based on distribution. The best performing methods particular were identified and used to model complete (as 2007) 639,671 categories 4,140,629 pages. methodology was validated using DMOZ subset 17,217 130,594 pages we demonstrated statistically that this works equally well large small datasets.