作者: Bin He , Tao Tao , Kevin Chen-Chuan Chang
关键词:
摘要: In the recent years, Web has been rapidly "deepened" with prevalence of databases online. On this deep Web, many sources are structured by providing query interfaces and results. Organizing such into a domain hierarchy is one critical steps toward integration heterogeneous sources. We observe that, for sources, schemas ie, attributes in interfaces) discriminative representatives thus can be exploited source characterization. particular, viewing as type categorical data, we abstract problem organization clustering data. Our approach hypothesizes that "homogeneous sources" characterized same hidden generative models their schemas. To find clusters governed statistical distributions, propose new objective function, model-differentiation, which employs principled hypothesis testing to maximize heterogeneity among clusters. evaluation over hundreds real indicates (1) schema-based accurately organizes object domains eg, Books, Movies), (2) on schemas, model-differentiation function outperforms existing ones, likelihood, entropy, context linkages, hierarchical agglomerative algorithm.