Organizing structured web sources by query schemas: a clustering approach

作者: Bin He , Tao Tao , Kevin Chen-Chuan Chang

DOI: 10.1145/1031171.1031178

关键词:

摘要: In the recent years, Web has been rapidly "deepened" with prevalence of databases online. On this deep Web, many sources are structured by providing query interfaces and results. Organizing such into a domain hierarchy is one critical steps toward integration heterogeneous sources. We observe that, for sources, schemas ie, attributes in interfaces) discriminative representatives thus can be exploited source characterization. particular, viewing as type categorical data, we abstract problem organization clustering data. Our approach hypothesizes that "homogeneous sources" characterized same hidden generative models their schemas. To find clusters governed statistical distributions, propose new objective function, model-differentiation, which employs principled hypothesis testing to maximize heterogeneity among clusters. evaluation over hundreds real indicates (1) schema-based accurately organizes object domains eg, Books, Movies), (2) on schemas, model-differentiation function outperforms existing ones, likelihood, entropy, context linkages, hierarchical agglomerative algorithm.

参考文章(30)
P. Berkhin, A Survey of Clustering Data Mining Techniques Grouping Multidimensional Data. pp. 25- 71 ,(2006) , 10.1007/3-540-28349-8_2
Oren Etzioni, Oren Zamir, Richard M. Karp, Omid Madani, Fast and intuitive clustering of web documents knowledge discovery and data mining. pp. 287- 290 ,(1997)
Bin He, Tao Tao, Kevin Chen-Chuan Chang, Clustering structured web sources: a schema-based, model-differentiation approach extending database technology. ,vol. 3268, pp. 536- 546 ,(2004) , 10.1007/978-3-540-30192-9_53
Joann J. Ordille, Anand Rajaraman, Alon Y. Levy, Querying Heterogeneous Information Sources Using Source Descriptions very large data bases. pp. 251- 262 ,(1996)
Kevin Chen-Chuan Chang, Bin He, Chengkai Li, Mitesh Patel, Zhen Zhang, Structured databases on the web: observations and implications international conference on management of data. ,vol. 33, pp. 61- 70 ,(2004) , 10.1145/1031570.1031584
Richard C. Dubes, Anil K. Jain, Algorithms for clustering data ,(1988)
T. Teichmann, H. D. Brunk, An introduction to mathematical statistics ,(1960)
A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review ACM Computing Surveys. ,vol. 31, pp. 264- 323 ,(1999) , 10.1145/331499.331504
David E. Hapeman, Categorical Data Analysis Technometrics. ,vol. 33, pp. 241- 241 ,(1991) , 10.1080/00401706.1991.10484817
Bin He, Kevin Chen-Chuan Chang, Jiawei Han, Discovering complex matchings across web query interfaces: a correlation mining approach knowledge discovery and data mining. pp. 148- 157 ,(2004) , 10.1145/1014052.1014071