作者: HUIQING LIU , LIMSOON WONG
DOI: 10.1142/S0219720003000216
关键词:
摘要: We describe a methodology, as well some related data mining tools, for analyzing sequence data. The methodology comprises three steps: (a) generating candidate features from the sequences, (b) selecting relevant candidates, and (c) integrating selected to build system recognize specific properties in also give techniques each of these steps. For features, we present various types based on idea k-grams. discuss signal-to-noise, t-statistics, entropy measures, correlation-based feature selection method. use machine learning methods, including C4.5, SVM, Naive Bayes. illustrate this problem recognizing translation initiation sites. how generate select that are useful understanding distinction between ATG sites those not. such reliable systems DNA sequences.