Indeterministic Handling of Uncertain Decisions in Duplicate Detection

作者: Maurice van Keulen , Fabian Panse , Norbert Ritter

DOI:

关键词:

摘要: In current research, duplicate detection is usually considered as a deterministic approach in which tuples are either declared duplicates or not. However, most often it not completely clear whether two represent the same real-world entity approaches, however, this uncertainty ignored, turn can lead to false decisions. paper, we present an indeterministic for handling uncertain decisions process by using probabilistic target schema. Thus, instead of deciding between multiple possible worlds, all these worlds be modeled resulting data. This minimizes negative impacts Furthermore, becomes almost fully automatic and human effort reduced large extent. Unfortunately, full-indeterministic definition too expensive (in time well storage) hence impractical. For that reason, additionally introduce several semi-indeterministic methods heuristically reducing set handled meaningful way.

参考文章(34)
Carlo Batini, Monica Scannapieco, Data Quality: Concepts, Methodologies and Techniques ,(2006)
Carlo Batini, Monica Scannapieco, Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications) Springer-Verlag New York, Inc.. ,(2006)
William W. Cohen, Pradeep Ravikumar, A hierarchical graphical model for record linkage uncertainty in artificial intelligence. pp. 454- 461 ,(2004) , 10.5555/1036843.1036898
J.R. Wang, S.E. Madnick, The inter-database instance identification problem in integrating autonomous systems [1989] Proceedings. Fifth International Conference on Data Engineering. pp. 46- 55 ,(1989) , 10.1109/ICDE.1989.47199
Maurizio Lenzerini, Data integration: a theoretical perspective symposium on principles of database systems. pp. 233- 246 ,(2002) , 10.1145/543613.543644
Peter Buneman, Wang-Chiew Tan, Provenance in Databases ,(2009)
Mauricio A. Hernández, Salvatore J. Stolfo, The merge/purge problem for large databases international conference on management of data. ,vol. 24, pp. 127- 138 ,(1995) , 10.1145/223784.223807
Jiewen Huang, Lyublena Antova, Christoph Koch, Dan Olteanu, MayBMS: a probabilistic database management system international conference on management of data. pp. 1071- 1074 ,(2009) , 10.1145/1559845.1559984
Alon Halevy, Michael Franklin, David Maier, Principles of dataspace systems symposium on principles of database systems. pp. 1- 9 ,(2006) , 10.1145/1142351.1142352