Spectral label refinement for noisy and missing text labels

作者: Yangqiu Song , Qiang Yang , Hailong Sun , Chenguang Wang , Ming Zhang

DOI:

关键词: Smoothness (probability theory)Pattern recognitionArtificial intelligenceAutomatic label placementSimilarity (geometry)Computer scienceContent (measure theory)Measure (data warehouse)Consistency (database systems)

摘要: With the recent growth of online content on Web, there have been more user generated data with noisy and missing labels, e.g., social tags voted labels from Amazon's Mechanical Turks. Most machine learning methods, which require accurate label sets, could not be trusted when sets were yet unreliable. In this paper, we provide a text refinement algorithm to adjust for such labeled datasets. We assume that can refined based certain confidence, similarity between being consistent labels. propose smoothness ratio criterion measure consistency data. demonstrate effectiveness refining eight document datasets, validate results are useful generating better

参考文章(27)
George Karypis, CLUTO - A Clustering Toolkit Defense Technical Information Center. ,(2002) , 10.21236/ADA439508
Daniel Boley, Principal Direction Divisive Partitioning Data Mining and Knowledge Discovery. ,vol. 2, pp. 325- 344 ,(1998) , 10.1023/A:1009740529316
Xiaojin Zhu, Andrew B. Goldberg, Stephen J. Wright, Dissimilarity in Graph-Based Semi-Supervised Classification international conference on artificial intelligence and statistics. pp. 155- 162 ,(2007)
Edith Law, Burr Settles, Tom Mitchell, Learning to tag from open vocabulary labels european conference on machine learning. ,vol. 6322, pp. 211- 226 ,(2010) , 10.1007/978-3-642-15883-4_14
Fan R K Chung, Spectral Graph Theory ,(1996)
Inderjit S. Dhillon, Dharmendra S. Modha, Concept Decompositions for Large Sparse Text Data Using Clustering Machine Learning. ,vol. 42, pp. 143- 175 ,(2001) , 10.1023/A:1007612920971
Xindong Wu, Qijun Chen, Xingquan Zhu, Eliminating class noise in large datasets international conference on machine learning. pp. 920- 927 ,(2003)
Rion Snow, Brendan O'Connor, Daniel Jurafsky, Andrew Y. Ng, Cheap and fast---but is it good? Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP '08. pp. 254- 263 ,(2008) , 10.3115/1613715.1613751
Shi Zhong, Joydeep Ghosh, Generative model-based document clustering: a comparative study Knowledge and Information Systems. ,vol. 8, pp. 374- 384 ,(2005) , 10.1007/S10115-004-0194-1
David F. Nettleton, Albert Orriols-Puig, Albert Fornells, A study of the effect of different types of noise on the precision of supervised learning techniques Artificial Intelligence Review. ,vol. 33, pp. 275- 306 ,(2010) , 10.1007/S10462-010-9156-Z