作者: Nathan O. Schmidt
DOI:
关键词: Proteome 、 Machine learning 、 Biology 、 Coding (social sciences) 、 Genome 、 Predictability 、 Artificial intelligence 、 k-mer 、 Histogram 、 Regular expression 、 Artificial neural network
摘要: A regular expression and region-specific filtering system for biological records at the National Center Biotechnology database is integrated into an object-oriented sequence counting application, a statistical software suite designed deployed to interpret resulting k-mer frequencies---with priority focus on nullomers. The proteome frequency spectra of ten model organisms genome two bacteria virus strains coding non-coding regions are comparatively scrutinized. We observe that naturally-evolved (NCBI/organism) artificially-biased (randomly-generated) sequences exhibit clear deviation from artificially-unbiased histogram distributions. Furthermore, preliminary assessment prime predictability conducted chronologically ordered NCBI snapshots over 18-month period using artificial neural network; three distinct supervised machine learning algorithms used train test customized data sets forecast future states---revealing that, modest degree, it feasible make such predictions.