Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks

作者: David R. Kelley , Jasper Snoek , John L. Rinn

DOI: 10.1101/GR.200535.115

关键词: Genome BiologyComputational biologyConvolutional neural networkBiologySequence analysisENCODEDNA sequencingGenomeGeneticsEpigenomicsArtificial neural network

摘要: The process of identifying genomic sites that show statistical relationships to phenotypes holds great promise for human health and disease (Hindorff et al. 2009). However, our current inability efficiently interpret noncoding variants impedes progress toward using personal genomes in medicine. Coordinated efforts survey the genome have shown sequences marked by DNA accessibility certain histone modifications are enriched statistically related (The ENCODE Project Consortium 2012; Roadmap Epigenomics 2015). first stages a mechanistic hypothesis can now be assigned directly overlap these annotations (Fu 2014; Kircher Ritchie 2014). However, simply considering variant with underutilizes data; more extracted understanding DNA–protein interactions as function underlying sequence. Proteins recognize specific signals influence its (Voss Hager 2014). Given training data, models parameterized machine learning effectively predict protein binding, accessibility, modifications, methylation from sequence (Das 2006; Arnold 2013; Benveniste Pinello Lee 2015; Setty Leslie Whitaker A trained model then annotate every nucleotide (and variant) on regulatory attributes. This upgrades previous approaches two ways. First, studied at finer resolution; researchers prioritize predicted drive activity devalue those irrelevant bystanders. Second, rare introduce gain will often not publicly available data. An accurate function, allowing follow-up consideration site. In recent years, artificial neural networks many stacked layers achieved breakthrough advances benchmark data sets image analysis (Krizhevsky 2012) natural language processing (Collobert 2011). Rather than choose features manually or preprocessing step, convolutional (CNNs) adaptively learn them during training. They apply nonlinear transformations map input informative high-dimensional representations trivialize classification regression (Bengio 2013). Early applications CNNs surpass established algorithms, such support vector machines random forests, predicting binding (Alipanahi Zhou Troyanskaya More precisely dissect sequences, thus improving interpretation. fully exploit value models, it is essential they technically conceptually accessible who take advantage their potential. Here, we Basset, an open source package deep functional activities sequences. We used Basset simultaneously 164 cell types mapped DNase-seq From sets, relevant motifs logic which combined determine cell-specific accessibility. achieving this level accuracy provides meaningful, nucleotide-precision measurements. Subsequently, assign Genome-wide association study (GWAS) cell-type–specific scores reflect difference between alleles. These highly predictive causal SNP among linked variants. Importantly, puts hands biology community, providing tools strategies train analyze new sets. In conjunction big offers promising future how crafts phenotypes.

参考文章(61)
Manu Setty, Christina S. Leslie, SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps PLOS Computational Biology. ,vol. 11, pp. e1004271- ,(2015) , 10.1371/JOURNAL.PCBI.1004271
Babak Alipanahi, Andrew Delong, Matthew T Weirauch, Brendan J Frey, Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning Nature Biotechnology. ,vol. 33, pp. 831- 838 ,(2015) , 10.1038/NBT.3300
Geoffrey E. Hinton, Vinod Nair, Rectified Linear Units Improve Restricted Boltzmann Machines international conference on machine learning. pp. 807- 814 ,(2010)
Yann LeCun, Xiang Zhang, Text Understanding from Scratch arXiv: Learning. ,(2015)
Darío G. Lupiáñez, Katerina Kraft, Verena Heinrich, Peter Krawitz, Francesco Brancati, Eva Klopocki, Denise Horn, Hülya Kayserili, John M. Opitz, Renata Laxova, Fernando Santos-Simarro, Brigitte Gilbert-Dussardier, Lars Wittler, Marina Borschiwer, Stefan A. Haas, Marco Osterwalder, Martin Franke, Bernd Timmermann, Jochen Hecht, Malte Spielmann, Axel Visel, Stefan Mundlos, Disruptions of Topological Chromatin Domains Cause Pathogenic Rewiring of Gene-Enhancer Interactions Cell. ,vol. 161, pp. 1012- 1025 ,(2015) , 10.1016/J.CELL.2015.04.004
Bing Ren, Adrian R. Krainer, Tom Maniatis, Qiang Wu, Ya Guo, Quan Xu, Daniele Canzio, Jia Shou, Jinhuan Li, David U. Gorkin, Inkyung Jung, Haiyang Wu, Yanan Zhai, Yuanxiao Tang, Yichao Lu, Yonghu Wu, Zhilian Jia, Wei Li, Michael Q. Zhang, CRISPR Inversion of CTCF Sites Alters Genome Topology and Enhancer/Promoter Function Cell. ,vol. 162, pp. 900- 910 ,(2015) , 10.1016/J.CELL.2015.07.038
Robert E Thurman, Eric Rynes, Richard Humbert, Jeff Vierstra, Matthew T Maurano, Eric Haugen, Nathan C Sheffield, Andrew B Stergachis, Hao Wang, Benjamin Vernot, Kavita Garg, Sam John, Richard Sandstrom, Daniel Bates, Lisa Boatman, Theresa K Canfield, Morgan Diegel, Douglas Dunn, Abigail K Ebersol, Tristan Frum, Erika Giste, Audra K Johnson, Ericka M Johnson, Tanya Kutyavin, Bryan Lajoie, Bum-Kyu Lee, Kristen Lee, Darin London, Dimitra Lotakis, Shane Neph, Fidencio Neri, Eric D Nguyen, Hongzhu Qu, Alex P Reynolds, Vaughn Roach, Alexias Safi, Minerva E Sanchez, Amartya Sanyal, Anthony Shafer, Jeremy M Simon, Lingyun Song, Shinny Vong, Molly Weaver, Yongqi Yan, Zhancheng Zhang, Zhuzhu Zhang, Boris Lenhard, Muneesh Tewari, Michael O Dorschner, R Scott Hansen, Patrick A Navas, George Stamatoyannopoulos, Vishwanath R Iyer, Jason D Lieb, Shamil R Sunyaev, Joshua M Akey, Peter J Sabo, Rajinder Kaul, Terrence S Furey, Job Dekker, Gregory E Crawford, John A Stamatoyannopoulos, None, The accessible chromatin landscape of the human genome Nature. ,vol. 489, pp. 75- 82 ,(2012) , 10.1038/NATURE11232
Dan Benveniste, Hans-Joachim Sonntag, Guido Sanguinetti, Duncan Sproul, Transcription factor binding predicts histone modifications in human cell lines Proceedings of the National Academy of Sciences of the United States of America. ,vol. 111, pp. 13367- 13372 ,(2014) , 10.1073/PNAS.1412081111
Mahmoud Ghandi, Dongwon Lee, Morteza Mohammad-Noori, Michael A. Beer, Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features PLoS Computational Biology. ,vol. 10, pp. e1003711- ,(2014) , 10.1371/JOURNAL.PCBI.1003711
Rupali P Patwardhan, Joseph B Hiatt, Daniela M Witten, Mee J Kim, Robin P Smith, Dalit May, Choli Lee, Jennifer M Andrie, Su-In Lee, Gregory M Cooper, Nadav Ahituv, Len A Pennacchio, Jay Shendure, Massively parallel functional dissection of mammalian enhancers in vivo Nature Biotechnology. ,vol. 30, pp. 265- 270 ,(2012) , 10.1038/NBT.2136