作者: Alberto Lavelli , Fabrizio Sebastiani , Roberto Zanoli
关键词: Compound term processing 、 Natural language 、 Categorization 、 Artificial intelligence 、 Representation (mathematics) 、 Cluster analysis 、 Information retrieval 、 Natural language processing 、 Computational linguistics 、 Computer science 、 Noun phrase 、 Thesaurus (information retrieval) 、 Term (time) 、 Index term
摘要: A number of content management tasks, including term categorization, clustering, and automated thesaurus generation, view natural language terms (e.g. words, noun phrases) as first-class objects, i.e. objects endowed with an internal representation which makes them suitable for explicit manipulation by the corresponding algorithms. The information retrieval (IR) literature has traditionally used extensional (aka distributional) according to a is represented "bag documents" in occurs. computational linguistics (CL) independently developed alternative distributional terms, terms" that co-occur it some document. This paper aims at discovering two representations most effective, brings about higher effectiveness once tasks require be explicitly manipulated. We carry out experiments on (i) categorization task, (ii) clustering task; this allows us compare different closely controlled experimental conditions. report results we categorize/cluster under 42 classes extracted from corpus more than 65,000 documents. Our show substantial difference between styles; give both intuitive explanation information-theoretic justification these behaviours.