Biological Sequence Data Mining

作者: Yuh-Jyh Hu

DOI: 10.1007/3-540-44794-6_19

关键词:

摘要: Biologists have determined that the control and regulation of gene expression is primarily by relatively short sequences in region surrounding a gene. These vary length, position, redundancy, orientation, bases. Finding these fundamental problem molecular biology with important applications. Though there exist many different approaches to signal/motif (i.e. sequence) finding, 2000 Pevzner Sze reported most current motif finding algorithms are incapable detecting target signals their so-called Challenge Problem. In this paper, we show using an iterative-restart design, our new algorithm can correctly find targets. Furthermore, taking into account fact some transcription factors form dimer or even more complex structures, process sometimes involve multiple factors, extend original challenging one. We address issue combinatorial gaps variable lengths. To demonstrate efficacy algorithm, tested it on series challenge problems, compared representative motif-finding algorithms. addition, verify its feasibility real-world applications, also several regulatory families yeast genes known motifs. The purpose paper two-fold. One introduce improved biological data mining capable dealing DNA sequences. other propose research direction for general KDD community.

参考文章(17)
Pavel A. Pevzner, Sing-Hoi Sze, Combinatorial Approaches to Finding Subtle Signals in DNA Sequences intelligent systems in molecular biology. ,vol. 8, pp. 269- 278 ,(2000)
Yuh-Jyh Hu, Dennis F. Kibler, Suzanne B. Sandmeyer, Detecting Motifs from Sequences international conference on machine learning. pp. 181- 190 ,(1999)
Martin Tompa, Saurabh Sinha, A Statistical Method for Finding Transcription Factor Binding Sites intelligent systems in molecular biology. ,vol. 8, pp. 344- 354 ,(2000)
C. Lawrence, S. Altschul, M. Boguski, J. Liu, A. Neuwald, J. Wootton, Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment Science. ,vol. 262, pp. 208- 214 ,(1993) , 10.1126/SCIENCE.8211139
Emily Rocke, Martin Tompa, An algorithm for finding novel gapped motifs in DNA sequences research in computational molecular biology. pp. 228- 233 ,(1998) , 10.1145/279069.279119
J. van Helden, B. André, J. Collado-Vides, Extracting Regulatory Sites from the Upstream Region of Yeast Genes by Computational Analysis of Oligonucleotide Frequencies Journal of Molecular Biology. ,vol. 281, pp. 827- 842 ,(1998) , 10.1006/JMBI.1998.1947
Timothy L. Bailey, Charles Elkan, Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization Machine Learning. ,vol. 21, pp. 51- 80 ,(1995) , 10.1007/BF00993379
Lisa Wodicka, Helin Dong, Michael Mittmann, Ming-Hsiu Ho, David J. Lockhart, Genome-wide expression monitoring in Saccharomyces cerevisiae Nature Biotechnology. ,vol. 15, pp. 1359- 1367 ,(1997) , 10.1038/NBT1297-1359
Ming Li, Bin Ma, Lusheng Wang, Finding similar regions in many strings Proceedings of the thirty-first annual ACM symposium on Theory of computing - STOC '99. pp. 473- 482 ,(1999) , 10.1145/301250.301376