Clustering of near-duplicate documents

作者： Joy Thomas , Sauraj Goswami , Vamsi Salaka

DOI:

关键词:

摘要: Documents likely to be near-duplicates are clustered based on document vectors that represent word-occurrence patterns in a relatively low-dimensional space. Edit distance between documents is defined comparing their vectors. In one process, initial clusters formed by applying first edit-distance constraint relative root of each cluster. The can merged subject second limits the maximum edit any two such whether it satisfied determined cluster structures rather than individual documents.

google.com 本地加速

freepatentsonline.com 本地加速

lens.org UNKNOWN 下载加速

freepatentsonline.com UNKNOWN 下载加速

参考文章(27)

Martin Potthast, Benno Stein, Applying Hash-based Indexing in Text-based Information Retrieval ,(2007)

Bernhard Glomann, Claus Neubauer, Klaus Brinker, Fabian Moerchen, Document clustering using a locality sensitive hashing function ,(2008)

David Sitsky, Daniel Noll, Edward Sheehy, Document Comparison Method And Apparatus ,(2008)

Einav Itamar, Indexing Method For Multimedia Feature Vectors Using Locality Sensitive Hashing ,(2009)

Bernhard Glomann, Claus Neubauer, Klaus Brinker, Fabian Moerchen, Online document clustering ,(2008)

Vladimir Dobrynin, David Patterson, Computer aided document retrieval ,(2004)

Alan R. Chappell, Judith R. Thomson, R. Scott Butner, William J. Harvey, Mark A. Whiting, Ryan E. Hohimer, Stephen C. Tratz, Patrick R. Paulson, Processes, data structures, and apparatuses for representing knowledge ,(2005)

Ashutosh Garg, Mayur Datar, Scalable user clustering based on set similarity ,(2006)

Youngsik Huh, B. S. Manjunath, Yang-lim Choi, Shiv Chandrasekaran, Method of indexing and searching feature vector space ,(2001)

10.

Anthony Tomasic, Saul Schleimer, Alex Aiken, Daniel Wilkerson, Joel Auslander, Steve Fink, Method and apparatus for indexing document content and content comparison with World Wide Web search service ,(2003)

Clustering of near-duplicate documents

来源期刊

我的账户

Clustering of near-duplicate documents

来源期刊

相似文章 10

我的账户