作者: Ahmad Ahmadov , Maik Thiele , Wolfgang Lehner , Robert Wrembel
关键词: Missing data 、 Data quality 、 Matching methods 、 Imputation (statistics) 、 Data mining 、 External data 、 Web tables 、 Computer science 、 Inference
摘要: Completeness as one of the four major dimensions data quality is a pervasive issue in modern databases. Although imputation has been studied extensively literature, most research focused on inference-based approach. We propose to harness Web tables an external source effectively and efficiently retrieve missing while taking into account inherent uncertainty lack veracity that they contain. Existing approaches mostly rely standard retrieval techniques out-of-the-box matching methods which result very low precision, especially when dealing with numerical data. We, therefore, novel approach by applying context similarity measures results significant increase precision procedure, ensuring imputed values are same domain magnitude local values, thus resulting accurate imputation. use Dresden Table Corpus comprised more than 125 million web extracted from Common Crawl our knowledge source. The comprehensive experimental demonstrate proposed method well outperforms default