作者: Hieu Quang Le , Stefan Conrad
DOI:
关键词:
摘要: This paper studies the problem of classifying structured data sources on Web. While prior works use all features, once extracted from search interfaces, we further refine feature set. In our research, each interface is treated simply as a bag-of-words. We choose subset words, which suited to classify web sources, by selection methods with new metrics and novel simple ranking scheme. Using aggressive approach, together Gaussian process classifier, obtained high classification performance in an evaluation over real data.