Distributional term representations: an experimental comparison

作者: Alberto Lavelli , Fabrizio Sebastiani , Roberto Zanoli

DOI: 10.1145/1031171.1031284

关键词: Compound term processingNatural languageCategorizationArtificial intelligenceRepresentation (mathematics)Cluster analysisInformation retrievalNatural language processingComputational linguisticsComputer scienceNoun phraseThesaurus (information retrieval)Term (time)Index term

摘要: A number of content management tasks, including term categorization, clustering, and automated thesaurus generation, view natural language terms (e.g. words, noun phrases) as first-class objects, i.e. objects endowed with an internal representation which makes them suitable for explicit manipulation by the corresponding algorithms. The information retrieval (IR) literature has traditionally used extensional (aka distributional) according to a is represented "bag documents" in occurs. computational linguistics (CL) independently developed alternative distributional terms, terms" that co-occur it some document. This paper aims at discovering two representations most effective, brings about higher effectiveness once tasks require be explicitly manipulated. We carry out experiments on (i) categorization task, (ii) clustering task; this allows us compare different closely controlled experimental conditions. report results we categorize/cluster under 42 classes extracted from corpus more than 65,000 documents. Our show substantial difference between styles; give both intuitive explanation information-theoretic justification these behaviours.

参考文章(40)
Bernardo Magnini, Gabriela Cavaglia, Integrating Subject Field Codes into WordNet language resources and evaluation. ,(2000)
Steven Finch, Finding structure in language The University of Edinburgh. ,(1995)
Mark Stevenson, Miles Whitehead, Tony Rose, The reuters corpus volume 1 - From yesterday's news to tomorrow's language resources language resources and evaluation. ,(2002)
Zellig Sabbettai Harris, Mathematical structures of language ,(1968)
J.R. Galliers, K. Spärck Jones, Evaluating natural language processing systems ,(1995)
Peter Schäuble, Daniel Knaus, The Various Roles of Information Structures Springer, Berlin, Heidelberg. pp. 282- 290 ,(1993) , 10.1007/978-3-642-50974-2_28
Pio Nardiello, Fabrizio Sebastiani, Alessandro Sperduti, Discretizing Continuous Attributes in AdaBoost for Text Categorization Lecture Notes in Computer Science. pp. 320- 334 ,(2003) , 10.1007/3-540-36618-0_23
Gerard Salton, Experiments in Automatic Thesaurus Construction for Information Retrieval. ifip congress. pp. 115- 123 ,(1971)
Thorsten Joachims, Making large scale SVM learning practical Technical reports. ,(1999) , 10.17877/DE290R-14262
Páraic Sheridan, Martin Braschlert, Peter Schäuble, Cross-Language Information Retrieval in a Multilingual Legal Domain european conference on research and advanced technology for digital libraries. pp. 253- 268 ,(1997) , 10.1007/BFB0026732