Effective Automated Feature Construction and Selection for Classification of Biological Sequences

作者: Uday Kamath , Kenneth De Jong , Amarda Shehu

DOI: 10.1371/JOURNAL.PONE.0099982

关键词: Artificial intelligenceSequenceBiologyPattern recognition (psychology)Kernel methodFeature (machine learning)Sequence analysisMachine learningConstruct (python library)Evolutionary algorithmBioinformaticsSet (abstract data type)

摘要: Background Many open problems in bioinformatics involve elucidating underlying functional signals biological sequences. DNA sequences, particular, are characterized by rich architectures which increasingly found to combine local and distal interactions at the nucleotide level. Problems of interest include detection regulatory regions, splice sites, exons, hypersensitive more. These naturally lend themselves formulation as classification machine learning. When is based on features extracted from sequences under investigation, success critically dependent chosen set features. Methodology We present an algorithmic framework (EFFECT) for automated We focus here involving state-of-the-art work learning shows be challenging complex combinations features. EFFECT uses a two-stage process first construct candidate sequence-based then select most effective subset task hand. Both stages make heavy use evolutionary algorithms efficiently guide search towards informative capable discriminating between that contain particular signal those do not. Results To demonstrate its generality, applied three separate importance research: recognition ALU sites. Comparisons with show both general powerful. In addition, detailed analysis constructed they valuable information about architecture, allowing biologists other researchers directly inspect potentially insights obtained assist wet-laboratory studies retainment or modification specific signal. Code, documentation, all data applications presented provided community http://www.cs.gmu.edu/~ashehu/?q=OurTools.

参考文章(118)
Rezarta Islamaj, Lise Getoor, W. John Wilbur, A Feature Generation Algorithm for Sequences with Application to Splice-Site Prediction Lecture Notes in Computer Science. pp. 553- 560 ,(2006) , 10.1007/11871637_55
Stephen Frederick Smith, A learning system based on genetic adaptive algorithms Ph. D. Thesis, University of Pittsburgh. ,(1980)
Gunnar Rätsch, Alexander Zien, Sören Sonnenburg, Christian Widmer, Christian Gehl, Vojtěch Franc, Jonas Behr, Fabio de Bona, Alexander Binder, Sebastian Henschel, The SHOGUN Machine Learning Toolbox Journal of Machine Learning Research. ,vol. 11, pp. 1799- 1802 ,(2010) , 10.5555/1756006.1859911
Rafael Ramirez, Montserrat Puiggros, A Genetic Programming Approach to Feature Selection and Classification of Instantaneous Cognitive States Proceedings of the 2007 EvoWorkshops 2007 on EvoCoMnet, EvoFIN, EvoIASP,EvoINTERACTION, EvoMUSART, EvoSTOC and EvoTransLog: Applications of Evolutionary Computing. pp. 311- 319 ,(2009) , 10.1007/978-3-540-71805-5_34
Christopher M. Bishop, Pattern Recognition and Machine Learning ,(2006)
Sören Sonnenburg, New Methods for Splice Site Recognition international conference on artificial neural networks. pp. 329- 336 ,(2002) , 10.1007/3-540-46084-5_54
Prescott Deininger, Alu elements: know the SINEs Genome Biology. ,vol. 12, pp. 236- 236 ,(2011) , 10.1186/GB-2011-12-12-236
Laurent Hyafil, Ronald L. Rivest, Constructing optimal binary decision trees is NP-complete☆ Information Processing Letters. ,vol. 5, pp. 15- 17 ,(1976) , 10.1016/0020-0190(76)90095-8
Dmitry N. Ivankov, Alexei V. Finkelstein, Prediction of protein folding rates from the amino acid sequence-predicted secondary structure. Proceedings of the National Academy of Sciences of the United States of America. ,vol. 101, pp. 8942- 8944 ,(2004) , 10.1073/PNAS.0402659101