作者: Joy Thomas , Sauraj Goswami , Vamsi Salaka
DOI:
关键词:
摘要: Documents likely to be near-duplicates are clustered based on document vectors that represent word-occurrence patterns in a relatively low-dimensional space. Edit distance between documents is defined comparing their vectors. In one process, initial clusters formed by applying first edit-distance constraint relative root of each cluster. The can merged subject second limits the maximum edit any two such whether it satisfied determined cluster structures rather than individual documents.