作者: Nurul A Emran , Suzanne Embury
DOI:
关键词: Mathematics 、 Software 、 Data mining 、 Population 、 Data quality 、 Completeness (statistics) 、 Data set 、 Population based data 、 Missing data 、 Functional dependency
摘要: Poor quality data such as with errors or missing values cause negative consequences in many application domains. An important aspect of is completeness. One problem completeness the individuals sets. Within a set, refer to real world entities whose information recorded. So far, studies however, there has been little discussion about how are assessed. In this thesis, we propose notion population-based (PBC) that deals problem, aim investigating what required measure PBC and identify needed support measurements practice. To achieve these aims, analyse elements requirements for measurement, resulting definition measurement formula. We an architecture systems determine technical terms software hardware components. analysis issues arise implementing makes contribution understanding feasibility provide accurate results. Further exploration particular issue was discovered showed when measuring across multiple databases, from those databases need be integrated materialised. Unfortunately, requirement may lead large internal store system impractical maintain. approach test hypothesis available storage space can optimised by materialising only partial contributing while retaining accuracy measurements. Our involves substituting some attributes smaller alternatives, exploiting approximate functional dependencies (AFDs) within each local database. space-accuracy trade-offs leads development algorithm assess candidate alternative space-saving (of measurement). The result several case conducted proxy assessment contributes offered proxies. A better dealing achieved through proposal investigation PBC,