Large-Scale Web Page Classification

作者: Sathi T. Marath , Michael Shepherd , Evangelos Milios , Jack Duffy

DOI: 10.1109/HICSS.2014.229

关键词:

摘要: This research investigates the design of a unified framework for content-based classification highly imbalanced hierarchical datasets, such as web directories. In an dataset, prior probability distribution category indicates presence or absence class imbalance. may include lack positive training instances (rarity) overabundance instances. We partitioned subcategories Yahoo! directory into five mutually exclusive groups based on distribution. The best performing methods particular were identified and used to model complete (as 2007) 639,671 categories 4,140,629 pages. methodology was validated using DMOZ subset 17,217 130,594 pages we demonstrated statistically that this works equally well large small datasets.

参考文章(87)
Xu-Ying Liu, Zhi-Hua Zhou, On multi-class cost-sensitive learning national conference on artificial intelligence. pp. 567- 572 ,(2006)
Charles X. Ling, Chenghui Li, Data mining for direct marketing: problems and solutions knowledge discovery and data mining. pp. 73- 79 ,(1998)
Robert C. Holte, Bruce W. Porter, Liane E. Acker, Concept learning and the problem of small disjuncts international joint conference on artificial intelligence. pp. 813- 818 ,(1989)
Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas, Handling imbalanced datasets: A review ,(2006)
Fernando Vilariño, Panagiota Spyridonos, Jordi Vitrià, Petia Radeva, Experiments with SVM and Stratified Sampling with an Imbalanced Problem: Detection of Intestinal Contractions Pattern Recognition and Image Analysis. pp. 783- 791 ,(2005) , 10.1007/11552499_86
N Cristianini, Icg Campbell, K Veropoulos, Controlling the Sensitivity of Support Vector Machines pp. 55- 60 ,(1999)
Jason Catlett, Mega induction: a Test Flight Machine Learning Proceedings 1991. pp. 596- 599 ,(1991) , 10.1016/B978-1-55860-200-7.50121-5
Sholom M. Weiss, Nitin Indurkhya, Predictive Data Mining: A Practical Guide ,(1997)