Using the Web 1T 5-Gram Database for Attribute Selection in Formal Concept Analysis to Correct Overstemmed Clusters

作者: Guymon R. Hall , Kazem Taghva

DOI: 10.1109/ITNG.2015.109

关键词: Formal concept analysisFeature selectionComputer scienceClosure (mathematics)Context (language use)Object (computer science)DatabaseBinary relationWord (computer architecture)Cluster analysisInformation retrieval

摘要: As part of information retrieval processes, words are often stemmed to a common root. The Porter Stemming Algorithm operates as rule-based suffix-removal process. can be viewed way cluster related together according one stem. Sometimes includes in that un-related. This experiment attempts correct this using Formal Concept Analysis (FCA). FCA is the process formulating formal concepts from given context. A context consists objects and attributes, binary relation indicates attributes possessed by each object. concept formed computing closure subsets attributes. Using Cranfield document collection, crafted comparison measure between word Google Web 1T 5-gram data set. clusters, results showed varying level success dependent upon error threshold allowed.

参考文章(8)
Rudolf Wille, RESTRUCTURING LATTICE THEORY: AN APPROACH BASED ON HIERARCHIES OF CONCEPTS international conference on formal concept analysis. pp. 314- 339 ,(2009) , 10.1007/978-3-642-01815-2_23
M.F. Porter, An algorithm for suffix stripping Program: Electronic Library and Information Systems. ,vol. 40, pp. 313- 316 ,(1997) , 10.1108/EB046814
Uta Priss, Formal concept analysis in information science Annual Review of Information Science and Technology. ,vol. 40, pp. 521- 543 ,(2006) , 10.1002/ARIS.V40:1
Alberto Acerbi, Vasileios Lampos, Philip Garnett, R. Alexander Bentley, The Expression of Emotions in 20th Century Books PLoS ONE. ,vol. 8, pp. e59030- ,(2013) , 10.1371/JOURNAL.PONE.0059030
Martin Reynaert, Parallel identification of the spelling variants in corpora Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data - AND '09. pp. 77- 84 ,(2009) , 10.1145/1568296.1568310
58. Corpora and collocations Mouton de Gruyter. pp. 1212- 1248 ,(2009) , 10.1515/9783110213881.2.1212