作者: Jeffrey Glass , Elizabeth Derr
DOI:
关键词:
摘要: A document similarity detection and classification system is presented. The employs a case-based method of classifying electronically distributed documents in which content chunks an unclassified are compared to the sets comprising each set previously classified sample order determine highest level resemblance between any documents. have been manually reviewed annotated distinguish classifications significant from insignificant chunks. These annotations used comparison process. If exceeding predetermined threshold detected, most significantly resembling assigned document. Sample may be acquired build maintain repository by detecting that similar other subjecting at least some manual review In preferred embodiment invention classify email messages support message filtering or objective.