作者: Hua Li , Dou Shen , Benyu Zhang , Zheng Chen , Qiang Yang
DOI: 10.1109/ICDM.2006.16
关键词:
摘要: This paper presents a novel algorithm to cluster emails according their contents and the sentence styles of subject lines. In our algorithm, natural language processing techniques frequent itemset mining are utilized automatically generate meaningful generalized patterns (GSPs) from subjects emails. Then we put forward unsupervised approach which treats GSPs as pseudo class labels conduct email clustering in supervised manner, although no human labeling is involved. Our proposed not only expected improve performance, it can also provide descriptions resulted clusters by GSPs. Experimental results on open dataset (Enron dataset) personal collected ourselves demonstrate that outperforms K-means terms popular measurement F1. Furthermore, naming readability improved 68.5% dataset.