Fast Kernels for Inexact String Matching

作者: Christina Leslie , Rui Kuang

DOI: 10.1007/978-3-540-45167-9_10

关键词:

摘要: We introduce several new families of string kernels designed in particular for use with support vector machines (SVMs) classification protein sequence data. These – restricted gappy kernels, substitution and wildcard are based on feature spaces indexed by k-length subsequences from the alphabet Σ (or augmented a character), hence they related to recently presented (k,m)-mismatch kernel used text classification. However, all we define here, value K(x,y) can be computed O(c K (|x| + |y|)) time, where constant c depends parameters but is independent size |Σ| alphabet. Thus computation these linear length sequences, like mismatch kernel, improve upon parameter-dependent \(c_K = k^{m+1} |\Sigma|^m\) kernel. compute efficiently using recursive function trie data structure relate our described transducer formalism. Finally, report experiments benchmark SCOP dataset, show that faster achieve SVM performance comparable Fisher derived profile hidden Markov models.

参考文章(13)
Text classification using string kernels Journal of Machine Learning Research. ,vol. 2, pp. 419- 444 ,(2002) , 10.1162/153244302760200687
D. Haussler, Convolution kernels on discrete structures Tech. Rep.. ,(1999)
Joel Cracraft, Michael M. Miyamoto, Phylogenetic Analysis of DNA Sequences ,(1991)
CHRISTINA LESLIE, ELEAZAR ESKIN, WILLIAM STAFFORD NOBLE, The spectrum kernel: a string kernel for SVM protein classification. pacific symposium on biocomputing. pp. 564- 575 ,(2001) , 10.1142/9789812799623_0053
S Altschula, Warren Gisha, Webb Millerb, E Meyersc, D Lipmana, None, Basic Local Alignment Search Tool Journal of Molecular Biology. ,vol. 215, pp. 403- 410 ,(1990) , 10.1016/S0022-2836(05)80360-2
Michael Gribskov, Nina L. Robinson, Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching Computational Biology and Chemistry. ,vol. 20, pp. 25- 33 ,(1996) , 10.1016/S0097-8485(96)80004-0
Eleazar Eskin, Christina S. Leslie, Jason Weston, William S. Noble, Mismatch String Kernels for SVM Protein Classification neural information processing systems. ,vol. 15, pp. 1441- 1448 ,(2002)
Tommi Jaakkola, Mark Diekhans, David Haussler, Using the Fisher Kernel Method to Detect Remote Protein Homologies intelligent systems in molecular biology. pp. 149- 158 ,(1999)
Alex J. Smola, S.v.n. Vishwanathan, Fast Kernels for String and Tree Matching neural information processing systems. ,vol. 15, pp. 585- 592 ,(2002)
S. Henikoff, J. G. Henikoff, Amino acid substitution matrices from protein blocks Proceedings of the National Academy of Sciences of the United States of America. ,vol. 89, pp. 10915- 10919 ,(1992) , 10.1073/PNAS.89.22.10915