Adding Semantics to Email Clustering

作者: Hua Li , Dou Shen , Benyu Zhang , Zheng Chen , Qiang Yang

DOI: 10.1109/ICDM.2006.16

关键词:

摘要: This paper presents a novel algorithm to cluster emails according their contents and the sentence styles of subject lines. In our algorithm, natural language processing techniques frequent itemset mining are utilized automatically generate meaningful generalized patterns (GSPs) from subjects emails. Then we put forward unsupervised approach which treats GSPs as pseudo class labels conduct email clustering in supervised manner, although no human labeling is involved. Our proposed not only expected improve performance, it can also provide descriptions resulted clusters by GSPs. Experimental results on open dataset (Enron dataset) personal collected ourselves demonstrate that outperforms K-means terms popular measurement F1. Furthermore, naming readability improved 68.5% dataset.

参考文章(5)
Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal, Discovering Frequent Closed Itemsets for Association Rules international conference on database theory. ,vol. 1540, pp. 398- 416 ,(1999) , 10.1007/3-540-49257-7_25
Gilles Celeux, Gérard Govaert, Comparison of the mixture and the classification maximum likelihood in cluster analysis Journal of Statistical Computation and Simulation. ,vol. 47, pp. 127- 146 ,(1993) , 10.1080/00949659308811525
Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, Tom Mitchell, Text Classification from Labeled and Unlabeled Documents using EM Machine Learning. ,vol. 39, pp. 103- 134 ,(2000) , 10.1023/A:1007692713085
Jianyong Wang, Jiawei Han, Jian Pei, CLOSET+: searching for the best strategies for mining frequent closed itemsets knowledge discovery and data mining. pp. 236- 245 ,(2003) , 10.1145/956750.956779