Aligning database columns using mutual information

作者: Andrew Philpot , Eduard Hovy , Patrick Pantel

DOI: 10.5555/1065226.1065285

关键词:

摘要: As with many large organizations, the Government's data is split in different ways and collected at times by people. The resulting massive heterogeneity means government staff cannot effectively locate, share, or compare across sources, let alone achieve computational interoperability. A case point California Air Resources Board (CARB), which faced challenge of integrating emissions inventory databases belonging to California's 35 air quality management districts create a state inventory. This must be submitted annually US EPA which, turn, perform assurance tests on these inventories integrate them into national for use tracking effects policies. premise our research that it possible significantly reduce amount manual labor required database wrapping integration automatically learning mappings data. In this research, we applied statistical algorithms discover correspondences comparable datasets. We have seen particular success an information theoretic model, called SIfT (Significance Information Translation), performs data-driven column alignments. mapping Santa Barbara County Pollution Control District's 2001 statewide database. fully customizable interface toolkit available http://sift.isi.edu/, allowing users new alignments, navigate inspect alignment decisions. On broader scale, work makes strides toward appeasing central problem legacy

参考文章(11)
José Luis Ambite, Yigal Arens, Walter Bourne, Steve Feiner, Luis Gravano, Vasileios Hatzivassiloglou, Eduard Hovy, Judith Klavans, Andrew Philpot, Usha Ramachandran, Kenneth A. Ross, Jay Sandhaus, Deniz Sarioz, Rolfe R. Schmidt, Cyrus Shahabi, Anurag Singla, Surabhan Temiyabutr, Brian Whitman, Kazi Zaman, Data Integration and Access Advances in Digital Government. pp. 85- 106 ,(2002) , 10.1007/0-306-47374-7_5
Patrick Hanks, Kenneth Ward Church, Word association norms, mutual information, and lexicography Computational Linguistics. ,vol. 16, pp. 22- 29 ,(1990) , 10.5555/89086.89095
Gerard Salton, Michael J. McGill, Introduction to Modern Information Retrieval ,(1983)
Eduard Hovy, Using an ontology to simplify data access Communications of the ACM. ,vol. 46, pp. 47- 49 ,(2003) , 10.1145/602421.602447
Jaewoo Kang, Jeffrey F. Naughton, On schema matching with opaque column names and data values international conference on management of data. pp. 205- 216 ,(2003) , 10.1145/872757.872783
Patrick Pantel, Dekang Lin, Discovering word senses from text Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02. pp. 613- 619 ,(2002) , 10.1145/775047.775138
W.M. Shaw, Robert Burgin, Patrick Howell, Performance standards and evaluations in IR test collections: cluster-based retrieval models Information Processing and Management. ,vol. 33, pp. 1- 14 ,(1997) , 10.1016/S0306-4573(96)00043-X
Tova Milo, Sagit Zohar, Using Schema Matching to Simplify Heterogeneous Data Translation very large data bases. pp. 122- 133 ,(1998)
AnHai Doan, Pedro Domingos, Alon Y. Halevy, Reconciling schemas of disparate data sources: a machine-learning approach international conference on management of data. ,vol. 30, pp. 509- 520 ,(2001) , 10.1145/375663.375731
Chaitan Baru, Amarnath Gupta, Bertram Ludäscher, Richard Marciano, Yannis Papakonstantinou, Pavel Velikhov, Vincent Chu, XML-based information mediation with MIX ACM SIGMOD Record. ,vol. 28, pp. 597- 599 ,(1999) , 10.1145/304181.304590