作者: David R. Kelley , Jasper Snoek , John L. Rinn
关键词: Genome Biology 、 Computational biology 、 Convolutional neural network 、 Biology 、 Sequence analysis 、 ENCODE 、 DNA sequencing 、 Genome 、 Genetics 、 Epigenomics 、 Artificial neural network
摘要: The process of identifying genomic sites that show statistical relationships to phenotypes holds great promise for human health and disease (Hindorff et al. 2009). However, our current inability efficiently interpret noncoding variants impedes progress toward using personal genomes in medicine. Coordinated efforts survey the genome have shown sequences marked by DNA accessibility certain histone modifications are enriched statistically related (The ENCODE Project Consortium 2012; Roadmap Epigenomics 2015). first stages a mechanistic hypothesis can now be assigned directly overlap these annotations (Fu 2014; Kircher Ritchie 2014). However, simply considering variant with underutilizes data; more extracted understanding DNA–protein interactions as function underlying sequence. Proteins recognize specific signals influence its (Voss Hager 2014). Given training data, models parameterized machine learning effectively predict protein binding, accessibility, modifications, methylation from sequence (Das 2006; Arnold 2013; Benveniste Pinello Lee 2015; Setty Leslie Whitaker A trained model then annotate every nucleotide (and variant) on regulatory attributes. This upgrades previous approaches two ways. First, studied at finer resolution; researchers prioritize predicted drive activity devalue those irrelevant bystanders. Second, rare introduce gain will often not publicly available data. An accurate function, allowing follow-up consideration site. In recent years, artificial neural networks many stacked layers achieved breakthrough advances benchmark data sets image analysis (Krizhevsky 2012) natural language processing (Collobert 2011). Rather than choose features manually or preprocessing step, convolutional (CNNs) adaptively learn them during training. They apply nonlinear transformations map input informative high-dimensional representations trivialize classification regression (Bengio 2013). Early applications CNNs surpass established algorithms, such support vector machines random forests, predicting binding (Alipanahi Zhou Troyanskaya More precisely dissect sequences, thus improving interpretation. fully exploit value models, it is essential they technically conceptually accessible who take advantage their potential. Here, we Basset, an open source package deep functional activities sequences. We used Basset simultaneously 164 cell types mapped DNase-seq From sets, relevant motifs logic which combined determine cell-specific accessibility. achieving this level accuracy provides meaningful, nucleotide-precision measurements. Subsequently, assign Genome-wide association study (GWAS) cell-type–specific scores reflect difference between alleles. These highly predictive causal SNP among linked variants. Importantly, puts hands biology community, providing tools strategies train analyze new sets. In conjunction big offers promising future how crafts phenotypes.