Data mining tools for biological sequences.

作者: HUIQING LIU , LIMSOON WONG

DOI: 10.1142/S0219720003000216

关键词:

摘要: We describe a methodology, as well some related data mining tools, for analyzing sequence data. The methodology comprises three steps: (a) generating candidate features from the sequences, (b) selecting relevant candidates, and (c) integrating selected to build system recognize specific properties in also give techniques each of these steps. For features, we present various types based on idea k-grams. discuss signal-to-noise, t-statistics, entropy measures, correlation-based feature selection method. use machine learning methods, including C4.5, SVM, Naive Bayes. illustrate this problem recognizing translation initiation sites. how generate select that are useful understanding distinction between ATG sites those not. such reliable systems DNA sequences.

参考文章(86)
Chandrashekhar P. Joshi, Hao Zhou, Xiaoqiu Huang, Vincent L. Chiang, Context sequences of translation initiation codon in plants Plant Molecular Biology. ,vol. 35, pp. 993- 1001 ,(1997) , 10.1023/A:1005816823636
Nello Cristianini, Colin Campbell, Thilo-Thomas Frieß, The Kernel-Adatron Algorithm: A Fast and Simple Learning Procedure for Support Vector Machines international conference on machine learning. pp. 188- 196 ,(1998)
Søren Brunak, Pierre Baldi, Bioinformatics: The Machine Learning Approach ,(1998)
David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams, Learning representations by back-propagating errors Nature. ,vol. 323, pp. 696- 699 ,(1988) , 10.1038/323533A0
John C. Platt, Fast training of support vector machines using sequential minimal optimization Advances in kernel methods. pp. 185- 208 ,(1999)
Collin M. Stultz, Raman Nambudripad, Richard H. Lathrop, James V. White, Predicting Protein Structure With Probabilistic Models Advances in Molecular and Cell Biology. ,vol. 22, pp. 447- 506 ,(1997) , 10.1016/S1569-2558(08)60483-X
Robert J. Brooker, Genetics: Analysis and Principles ,(1998)
Stuart Russell, John Binder, Daphne Koller, Keiji Kanazawa, Local learning in probabilistic networks with hidden variables international joint conference on artificial intelligence. pp. 1146- 1152 ,(1995)
David Heckerman, Bayesian networks for knowledge discovery knowledge discovery and data mining. pp. 273- 305 ,(1996)
K Asai, T Yada, M Ishikawa, H Tanaka, Extraction of hidden Markov model representations of signal patterns in DNA sequences. pacific symposium on biocomputing. pp. 686- 696 ,(1996)