TALE: Transformer-based protein function Annotation with joint sequence–Label Embedding

作者: Yue Cao , Yang Shen

DOI: 10.1101/2020.09.27.315937

关键词: EmbeddingMachine learningAnnotationGeneProtein functionSource codeTransformer (machine learning model)Computer scienceDirected graphArtificial intelligenceDeep learning

摘要: Abstract Motivation Facing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a alternative to experimental approaches. However, current methods can have applicability while relying on besides sequences, or lack generalizability novel species functions. Results To overcome aforementioned barriers in generalizability, we propose deep learning model, named Transformer-based Annotation through joint sequence–Label Embedding (TALE). For generalizbility sequences use self attention-based transformers capture global patterns sequences. unseen rarely seen functions, also embed labels (hierarchical GO terms directed graphs) together with inputs/features (sequences) latent space. Combining TALE similarity-based method, TALE+ outperformed competing when only input is available. It even state-of-the-art method using network information sequence, two of three gene ontologies. Furthermore, showed superior proteins low homology never/rarely annotated functions compared training data, revealing insights into sequence–function relationship. Ablation studies elucidated contributions algorithmic components toward accuracy generalizability. Availability The source codes models are available at https://github.com/Shen-Lab/TALE Contact yshen@tamu.edu Supplementary Bioinformatics online.

参考文章(30)
Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization arXiv: Learning. ,(2014)
Craig E Jones, Ute Baumann, Alfred L Brown, Automated methods of predicting the function of biological sequences using GO and BLAST. BMC Bioinformatics. ,vol. 6, pp. 272- 272 ,(2005) , 10.1186/1471-2105-6-272
Jianyi Yang, Renxiang Yan, Ambrish Roy, Dong Xu, Jonathan Poisson, Yang Zhang, The I-TASSER Suite: protein structure and function prediction Nature Methods. ,vol. 12, pp. 7- 8 ,(2015) , 10.1038/NMETH.3213
Benjamin Buchfink, Chao Xie, Daniel H Huson, Fast and sensitive protein alignment using DIAMOND Nature Methods. ,vol. 12, pp. 59- 60 ,(2015) , 10.1038/NMETH.3176
S Asburner, CA Ball, JA Blake, D Botstein, H Butler, JM Cherry, AP Davis, K Dolinski, SS Dwight, JT Eppig, MA Harris, DP Hill, L Issel‐Tarver, A Kasarskis, S Lewis, JC Matese, JE Richardson, M Ringwald, GM Rubin, G Sherlock, Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics. ,vol. 25, pp. 25- 29 ,(2000) , 10.1038/75556
Murray Stewart, Helen M. Kent, Airlie J. McCoy, Structural basis for molecular recognition between nuclear transport factor 2 (NTF2) and the GDP-bound form of the ras-family GTPase ran Journal of Molecular Biology. ,vol. 277, pp. 635- 646 ,(1998) , 10.1006/JMBI.1997.1602
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei, ImageNet: A large-scale hierarchical image database computer vision and pattern recognition. pp. 248- 255 ,(2009) , 10.1109/CVPR.2009.5206848
Predrag Radivojac, Wyatt T Clark, Tal Ronnen Oron, Alexandra M Schnoes, Tobias Wittkop, Artem Sokolov, Kiley Graim, Christopher Funk, Karin Verspoor, Asa Ben-Hur, Gaurav Pandey, Jeffrey M Yunes, Ameet S Talwalkar, Susanna Repo, Michael L Souza, Damiano Piovesan, Rita Casadio, Zheng Wang, Jianlin Cheng, Hai Fang, Julian Gough, Patrik Koskinen, Petri Törönen, Jussi Nokso-Koivisto, Liisa Holm, Domenico Cozzetto, Daniel WA Buchan, Kevin Bryson, David T Jones, Bhakti Limaye, Harshal Inamdar, Avik Datta, Sunitha K Manjari, Rajendra Joshi, Meghana Chitale, Daisuke Kihara, Andreas M Lisewski, Serkan Erdin, Eric Venner, Olivier Lichtarge, Robert Rentzsch, Haixuan Yang, Alfonso E Romero, Prajwal Bhat, Alberto Paccanaro, Tobias Hamp, Rebecca Kaßner, Stefan Seemayer, Esmeralda Vicedo, Christian Schaefer, Dominik Achten, Florian Auer, Ariane Boehm, Tatjana Braun, Maximilian Hecht, Mark Heron, Peter Hönigschmid, Thomas A Hopf, Stefanie Kaufmann, Michael Kiening, Denis Krompass, Cedric Landerer, Yannick Mahlich, Manfred Roos, Jari Björne, Tapio Salakoski, Andrew Wong, Hagit Shatkay, Fanny Gatzmann, Ingolf Sommer, Mark N Wass, Michael JE Sternberg, Nives Škunca, Fran Supek, Matko Bošnjak, Panče Panov, Sašo Džeroski, Tomislav Šmuc, Yiannis AI Kourmpetis, Aalt DJ Van Dijk, Cajo JF Ter Braak, Yuanpeng Zhou, Qingtian Gong, Xinran Dong, Weidong Tian, Marco Falda, Paolo Fontana, Enrico Lavezzo, Barbara Di Camillo, Stefano Toppo, Liang Lan, Nemanja Djuric, Yuhong Guo, Slobodan Vucetic, Amos Bairoch, Michal Linial, Patricia C Babbitt, Steven E Brenner, Christine Orengo, Burkhard Rost, Sean D Mooney, Iddo Friedberg, None, A large-scale evaluation of computational protein function prediction Nature Methods. ,vol. 10, pp. 221- 227 ,(2013) , 10.1038/NMETH.2340
Yuxiang Jiang, Tal Ronnen Oron, Wyatt T Clark, Asma R Bankapur, Daniel D’Andrea, Rosalba Lepore, Christopher S Funk, Indika Kahanda, Karin M Verspoor, Asa Ben-Hur, Da Chen Emily Koo, Duncan Penfold-Brown, Dennis Shasha, Noah Youngs, Richard Bonneau, Alexandra Lin, Sayed ME Sahraeian, Pier Luigi Martelli, Giuseppe Profiti, Rita Casadio, Renzhi Cao, Zhaolong Zhong, Jianlin Cheng, Adrian Altenhoff, Nives Skunca, Christophe Dessimoz, Tunca Dogan, Kai Hakala, Suwisa Kaewphan, Farrokh Mehryary, Tapio Salakoski, Filip Ginter, Hai Fang, Ben Smithers, Matt Oates, Julian Gough, Petri Törönen, Patrik Koskinen, Liisa Holm, Ching-Tai Chen, Wen-Lian Hsu, Kevin Bryson, Domenico Cozzetto, Federico Minneci, David T Jones, Samuel Chapman, Dukka Bkc, Ishita K Khan, Daisuke Kihara, Dan Ofer, Nadav Rappoport, Amos Stern, Elena Cibrian-Uhalte, Paul Denny, Rebecca E Foulger, Reija Hieta, Duncan Legge, Ruth C Lovering, Michele Magrane, Anna N Melidoni, Prudence Mutowo-Meullenet, Klemens Pichler, Aleksandra Shypitsyna, Biao Li, Pooya Zakeri, Sarah ElShal, Léon-Charles Tranchevent, Sayoni Das, Natalie L Dawson, David Lee, Jonathan G Lees, Ian Sillitoe, Prajwal Bhat, Tamás Nepusz, Alfonso E Romero, Rajkumar Sasidharan, Haixuan Yang, Alberto Paccanaro, Jesse Gillis, Adriana E Sedeño-Cortés, Paul Pavlidis, Shou Feng, Juan M Cejuela, Tatyana Goldberg, Tobias Hamp, Lothar Richter, Asaf Salamov, Toni Gabaldon, Marina Marcet-Houben, Fran Supek, Qingtian Gong, Wei Ning, Yuanpeng Zhou, Weidong Tian, Marco Falda, Paolo Fontana, Enrico Lavezzo, Stefano Toppo, Carlo Ferrari, Manuel Giollo, Damiano Piovesan, Silvio CE Tosatto, Angela Del Pozo, José M Fernández, Paolo Maietta, Alfonso Valencia, Michael L Tress, Alfredo Benso, Stefano Di Carlo, Gianfranco Politano, Alessandro Savino, Hafeez Ur Rehman, Matteo Re, Marco Mesiti, Giorgio Valentini, Joachim W Bargsten, Aalt DJ van Dijk, Branislava Gemovic, Sanja Glisic, Vladmir Perovic, Veljko Veljkovic, Nevena Veljkovic, Danillo C Almeida-e-Silva, Ricardo ZN Vencio, Malvika Sharan, Joerg Vogel, Lakesh Kansakar, Shanshan Zhang, Slobodan Vucetic, Zheng Wang, Michael JE Sternberg, Mark N Wass, Rachael P Huntley, Maria J Martin, Claire O’Donovan, Peter N Robinson, Yves Moreau, Anna Tramontano, Patricia C Babbitt, Steven E Brenner, Michal Linial, Christine A Orengo, Burkhard Rost, Casey S Greene, Sean D Mooney, Iddo Friedberg, Predrag Radivojac, None, An expanded evaluation of protein function prediction methods shows an improvement in accuracy Genome Biology. ,vol. 17, pp. 184- 184 ,(2016) , 10.1186/S13059-016-1037-6
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, Xiaoqiang Zheng, None, TensorFlow: a system for large-scale machine learning operating systems design and implementation. pp. 265- 283 ,(2016) , 10.5555/3026877.3026899