作者: Neil Cooke , Lee Gillam
DOI: 10.1007/978-1-4471-2236-4_16
关键词:
摘要: In this paper we elaborate a near-duplicate and plagiarism detection service that combines both Crowd Cloud computing in searching for evaluating matching documents. We believe our approach could be used across collaborating or competing Enterprises, against the web, without any Enterprise needing to reveal contents of its corporate (confidential) The service involves novel document fingerprinting which derives grammatical patterns but does not require knowledge rely on hash-based approaches. Our generates lossy highly compressed signature from it is possible generate fixed-length as fingerprints shingles. Fingerprint sizes are established by estimating likely random hit rates resulting size pattern target search. geared towards enabling Clowns, those who may attempt to, have, leaked confidential sensitive information, have otherwise plagiarized, provide copy original information. Crowds validate results emerging systematic evaluation service, ensuring modifications continue act effectively continuous scaling-up. discuss formulation assess efficacy reference an international benchmarking competition where system achieves top 5 performance (Precision=0.96 Recall=0.39).