作者: Andrew Philpot , Eduard Hovy , Patrick Pantel
关键词:
摘要: As with many large organizations, the Government's data is split in different ways and collected at times by people. The resulting massive heterogeneity means government staff cannot effectively locate, share, or compare across sources, let alone achieve computational interoperability. A case point California Air Resources Board (CARB), which faced challenge of integrating emissions inventory databases belonging to California's 35 air quality management districts create a state inventory. This must be submitted annually US EPA which, turn, perform assurance tests on these inventories integrate them into national for use tracking effects policies. premise our research that it possible significantly reduce amount manual labor required database wrapping integration automatically learning mappings data. In this research, we applied statistical algorithms discover correspondences comparable datasets. We have seen particular success an information theoretic model, called SIfT (Significance Information Translation), performs data-driven column alignments. mapping Santa Barbara County Pollution Control District's 2001 statewide database. fully customizable interface toolkit available http://sift.isi.edu/, allowing users new alignments, navigate inspect alignment decisions. On broader scale, work makes strides toward appeasing central problem legacy