摘要: Given two input collections of sets, a set-similarity join (SSJoin) identifies all pairs one from each collection, that have high similarity. Recent work has identified SSJoin as useful primitive operator in data cleaning. In this paper, we propose new algorithms for SSJoin. Our important features: They are exact, i.e., they always produce the correct answer, and carry precise performance guarantees. We believe our first to both features; previous with guarantees only probabilistically approximate. demonstrate effectiveness using thorough experimental evaluation over real-life synthetic sets.