Statistical schema matching across web query interfaces

作者: Bin He , Kevin Chen-Chuan Chang

DOI: 10.1145/872757.872784

关键词: Matching (statistics)Data miningInformation retrievalSchema (psychology)Computer scienceSemi-structured modelWeb search querySchema matchingStar schemaSynonym (database)Information schema

摘要: Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the of multiple schemas has essentially relied on finding pairwise-attribute correspondence. This paper proposes different approach, motivated by large numbers data sources Internet. On this "deep Web," we observe two distinguishing characteristics that offer new view considering schema matching: First, as Web scales, there are ample provide structured in same domains (e.g., books and automobiles). Second, while proliferate, their aggregate vocabulary tends to converge at relatively small size. Motivated these observations, propose paradigm, statistical Unlike traditional approaches using correspondence, take holistic approach match all input an underlying generative model. We general framework MGS such hidden model discovery, which consists hypothesis modeling, generation, selection. Further, specialize develop Algorithm MGSsd, targeting synonym canonical matching, designing discovering specifically captures attributes. demonstrate our over hundreds real four results show good accuracy.

参考文章(18)
Leonard J. Seligman, Arnon Rosenthal, Paul E. Lehner, Angela Smith, Data Integration: Where Does the Time Go? IEEE Data(base) Engineering Bulletin. ,vol. 25, pp. 3- 10 ,(2002)
TH Cormen, RL Rivest, CE Leiserson, C Stein, Introduction to Algorithms, 2nd edition. ,(2001)
Zachary G Ives, Oren Etzioni, Luke McDowell, Igor Tatarinov, Alon Halevy, Anhai Doan, Jayant Madhaven, Crossing the Structure Chasm conference on innovative data systems research. ,(2003)
Shamkant B. Navathe, Suresh G. Gadgil, A Methodology for View Inegration in Logical Database Design very large data bases. pp. 142- 164 ,(1982)
Bin He, Tao Tao, Kevin Chen-Chuan Chang, Clustering structured web sources: a schema-based, model-differentiation approach extending database technology. ,vol. 3268, pp. 536- 546 ,(2004) , 10.1007/978-3-540-30192-9_53
Kjell A. Doksum, Peter J. Bickel, Mathematical Statistics: Basic Ideas and Selected Topics ,(1977)
Kevin Chen-Chuan Chang, Bin He, Chengkai Li, Mitesh Patel, Zhen Zhang, Structured databases on the web: observations and implications international conference on management of data. ,vol. 33, pp. 61- 70 ,(2004) , 10.1145/1031570.1031584
Erhard Rahm, Philip A. Bernstein, A survey of approaches to automatic schema matching very large data bases. ,vol. 10, pp. 334- 350 ,(2001) , 10.1007/S007780100057
William W. Cohen, Integration of heterogeneous databases without common domains using queries based on textual similarity Proceedings of the 1998 ACM SIGMOD international conference on Management of data - SIGMOD '98. ,vol. 27, pp. 201- 212 ,(1998) , 10.1145/276304.276323
A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum Likelihood from Incomplete Data Via theEMAlgorithm Journal of the Royal Statistical Society: Series B (Methodological). ,vol. 39, pp. 1- 22 ,(1977) , 10.1111/J.2517-6161.1977.TB01600.X