Clustering of near-duplicate documents

作者: Joy Thomas , Sauraj Goswami , Vamsi Salaka

DOI:

关键词:

摘要: Documents likely to be near-duplicates are clustered based on document vectors that represent word-occurrence patterns in a relatively low-dimensional space. Edit distance between documents is defined comparing their vectors. In one process, initial clusters formed by applying first edit-distance constraint relative root of each cluster. The can merged subject second limits the maximum edit any two such whether it satisfied determined cluster structures rather than individual documents.

参考文章(27)
Bernhard Glomann, Claus Neubauer, Klaus Brinker, Fabian Moerchen, Document clustering using a locality sensitive hashing function ,(2008)
David Sitsky, Daniel Noll, Edward Sheehy, Document Comparison Method And Apparatus ,(2008)
Bernhard Glomann, Claus Neubauer, Klaus Brinker, Fabian Moerchen, Online document clustering ,(2008)
Vladimir Dobrynin, David Patterson, Computer aided document retrieval ,(2004)
Alan R. Chappell, Judith R. Thomson, R. Scott Butner, William J. Harvey, Mark A. Whiting, Ryan E. Hohimer, Stephen C. Tratz, Patrick R. Paulson, Processes, data structures, and apparatuses for representing knowledge ,(2005)
Ashutosh Garg, Mayur Datar, Scalable user clustering based on set similarity ,(2006)
Youngsik Huh, B. S. Manjunath, Yang-lim Choi, Shiv Chandrasekaran, Method of indexing and searching feature vector space ,(2001)
Anthony Tomasic, Saul Schleimer, Alex Aiken, Daniel Wilkerson, Joel Auslander, Steve Fink, Method and apparatus for indexing document content and content comparison with World Wide Web search service ,(2003)