作者: Pu Wang , Carlotta Domeniconi
关键词:
摘要: Document classification presents difficult challenges due to the sparsity and high dimensionality of text data, complex semantics natural language. The traditional document representation is a word-based vector (Bag Words, or BOW), where each dimension associated with term dictionary containing all words that appear in corpus. Although simple commonly used, this has several limitations. It essential embed semantic information conceptual patterns order enhance prediction capabilities algorithms. In paper, we overcome shortages BOW approach by embedding background knowledge derived from Wikipedia into kernel, which then used enrich documents. Our empirical evaluation real data sets demonstrates our successfully achieves improved accuracy respect technique, other recently developed methods.