On the K-Mer Frequency Spectra of Organism Genome and Proteome Sequences with a Preliminary Machine Learning Assessment of Prime Predictability

作者: Nathan O. Schmidt

DOI:

关键词: ProteomeMachine learningBiologyCoding (social sciences)GenomePredictabilityArtificial intelligencek-merHistogramRegular expressionArtificial neural network

摘要: A regular expression and region-specific filtering system for biological records at the National Center Biotechnology database is integrated into an object-oriented sequence counting application, a statistical software suite designed deployed to interpret resulting k-mer frequencies---with priority focus on nullomers. The proteome frequency spectra of ten model organisms genome two bacteria virus strains coding non-coding regions are comparatively scrutinized. We observe that naturally-evolved (NCBI/organism) artificially-biased (randomly-generated) sequences exhibit clear deviation from artificially-unbiased histogram distributions. Furthermore, preliminary assessment prime predictability conducted chronologically ordered NCBI snapshots over 18-month period using artificial neural network; three distinct supervised machine learning algorithms used train test customized data sets forecast future states---revealing that, modest degree, it feasible make such predictions.

参考文章(42)
Chitra Dutta, Jyotirmoy Das, Mathematical characterization of Chaos Game Representation: New algorithms for nucleotide sequence analysis Journal of Molecular Biology. ,vol. 228, pp. 715- 719 ,(1992) , 10.1016/0022-2836(92)90857-G
Kuo-Chen Chou, Yuan-Sun Kiang, The biological functions of low-frequency vibrations (phonons) 5. A phenomenological theory. Biophysical Chemistry. ,vol. 22, pp. 219- 235 ,(1985) , 10.1016/0301-4622(85)80045-4
V. N. Blinov, V. L. Golo, Acoustic spectroscopy of DNA in the gigahertz range Physical Review E. ,vol. 83, pp. 021904- 021904 ,(2011) , 10.1103/PHYSREVE.83.021904
Kuo-Chen Chou, The biological functions of low-frequency vibrations (phonons) Biophysical Chemistry. ,vol. 20, pp. 61- 71 ,(1984) , 10.1016/0301-4622(84)80005-8
Bernhard Haubold, Nora Pierstorff, Friedrich Möller, Thomas Wiehe, Genome comparison without alignment using shortest unique substrings. BMC Bioinformatics. ,vol. 6, pp. 123- 123 ,(2005) , 10.1186/1471-2105-6-123
Christopher Bystroff, Vesteinn Thorsson, David Baker, HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. Journal of Molecular Biology. ,vol. 301, pp. 173- 190 ,(2000) , 10.1006/JMBI.2000.3837
Igor Ulitsky, David Burstein, Tamir Tuller, Benny Chor, The average common substring approach to phylogenomic reconstruction. Journal of Computational Biology. ,vol. 13, pp. 336- 350 ,(2006) , 10.1089/CMB.2006.13.336
C. Burge, A. M. Campbell, S. Karlin, Over- and under-representation of short oligonucleotides in DNA sequences. Proceedings of the National Academy of Sciences of the United States of America. ,vol. 89, pp. 1358- 1362 ,(1992) , 10.1073/PNAS.89.4.1358
Lloyd Demetrius, Quantum statistics and allometric scaling of organisms Physica A-statistical Mechanics and Its Applications. ,vol. 322, pp. 477- 490 ,(2003) , 10.1016/S0378-4371(03)00013-X
A. PROVATA, Y. ALMIRANTIS, FRACTAL CANTOR PATTERNS IN THE SEQUENCE STRUCTURE OF DNA Fractals. ,vol. 08, pp. 15- 27 ,(2000) , 10.1142/S0218348X00000044