Structure-based document model with discrete wavelet transforms and its application to document classification

作者: Supphachai Thaicharoen , Krzysztof J. Cios , Tom Altman

DOI:

关键词: Document clusteringRepresentation (mathematics)Vector space modelComputer scienceTerm (time)Data miningTREC GenomicsDocument classificationBinary classificationSupport vector machine

摘要: Term signal is an existing text representation that depicts a term as vector of frequencies occurrences in number user-defined partitions document. Although augments the traditional space model with patterns occurrences, its document division not coherent actual logical structure In this paper, we propose novel model, termed Structure-Based Document Model Discrete Wavelet Transforms (SDMDWT), exploits structural information documents and mathematical transforms for representation. The proposed SDMDWT enhances concept by additionally taking into consideration document's during division. We evaluated on two different domains standard data sets, WebKB 4-Universities TREC Genomics 2005, using Support Vector Machines binary classification. experimental results show our demonstrates promising improvements classification performances over models.

参考文章(21)
Ivan W. Selesnick, Jan E. Odegard, C. S. Burrus, Haitao Guo, Ramesh A. Gopinath, Introduction to Wavelets and Wavelet Transforms: A Primer ,(1997)
Laurence Anthony F Park, Kotagiri Ramamohanarao, Hybrid pre-query term expansion using latent semantic analysis international conference on data mining. pp. 178- 185 ,(2004) , 10.1109/ICDM.2004.10085
Laurence A. F. Park, Marimuthu Palaniswami, Kotagiri Ramamohanarao, A Novel Web Text Mining Method Using the Discrete Cosine Transform european conference on principles of data mining and knowledge discovery. pp. 385- 396 ,(2002) , 10.1007/3-540-45681-3_32
Mark A. Hall, Ian H. Witten, Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques ,(1999)
Jaideva C. Goswami, Andrew K. Chan, Fundamentals of Wavelets: Theory, Algorithms, and Applications ,(2011)
James S. Walker, A Primer on Wavelets and Their Scientific Applications Journal of the American Statistical Association. ,vol. 95, pp. 1008- ,(1999) , 10.1201/9781420050011
Andrej Bratko, Bogdan Filipič, Exploiting structural information for semi-structured document categorization Information Processing and Management. ,vol. 42, pp. 679- 694 ,(2006) , 10.1016/J.IPM.2005.06.003
Laurence AF Park, Marimuthu Palaniswami, Kotagiri Ramamohanarao, A novel document ranking method using the discrete cosine transform IEEE Transactions on Pattern Analysis and Machine Intelligence. ,vol. 27, pp. 130- 135 ,(2005) , 10.1109/TPAMI.2005.2
Laurence A. F. Park, Kotagiri Ramamohanarao, Marimuthu Palaniswami, A novel document retrieval method using the discrete wavelet transform ACM Transactions on Information Systems. ,vol. 23, pp. 267- 298 ,(2005) , 10.1145/1080343.1080345
M.F. Porter, An algorithm for suffix stripping Program: Electronic Library and Information Systems. ,vol. 40, pp. 313- 316 ,(1997) , 10.1108/EB046814