作者: Bin He , Kevin Chen-Chuan Chang
关键词: Matching (statistics) 、 Data mining 、 Information retrieval 、 Schema (psychology) 、 Computer science 、 Semi-structured model 、 Web search query 、 Schema matching 、 Star schema 、 Synonym (database) 、 Information schema
摘要: Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the of multiple schemas has essentially relied on finding pairwise-attribute correspondence. This paper proposes different approach, motivated by large numbers data sources Internet. On this "deep Web," we observe two distinguishing characteristics that offer new view considering schema matching: First, as Web scales, there are ample provide structured in same domains (e.g., books and automobiles). Second, while proliferate, their aggregate vocabulary tends to converge at relatively small size. Motivated these observations, propose paradigm, statistical Unlike traditional approaches using correspondence, take holistic approach match all input an underlying generative model. We general framework MGS such hidden model discovery, which consists hypothesis modeling, generation, selection. Further, specialize develop Algorithm MGSsd, targeting synonym canonical matching, designing discovering specifically captures attributes. demonstrate our over hundreds real four results show good accuracy.