DOI: 10.1101/2020.09.27.315937
关键词: Embedding 、 Machine learning 、 Annotation 、 Gene 、 Protein function 、 Source code 、 Transformer (machine learning model) 、 Computer science 、 Directed graph 、 Artificial intelligence 、 Deep learning
摘要: Abstract Motivation Facing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a alternative to experimental approaches. However, current methods can have applicability while relying on besides sequences, or lack generalizability novel species functions. Results To overcome aforementioned barriers in generalizability, we propose deep learning model, named Transformer-based Annotation through joint sequence–Label Embedding (TALE). For generalizbility sequences use self attention-based transformers capture global patterns sequences. unseen rarely seen functions, also embed labels (hierarchical GO terms directed graphs) together with inputs/features (sequences) latent space. Combining TALE similarity-based method, TALE+ outperformed competing when only input is available. It even state-of-the-art method using network information sequence, two of three gene ontologies. Furthermore, showed superior proteins low homology never/rarely annotated functions compared training data, revealing insights into sequence–function relationship. Ablation studies elucidated contributions algorithmic components toward accuracy generalizability. Availability The source codes models are available at https://github.com/Shen-Lab/TALE Contact yshen@tamu.edu Supplementary Bioinformatics online.