作者: Brendan R E Ansell , Bernard J Pope , Peter Georgeson , Samantha J Emery-Corbin , Aaron R Jex
DOI: 10.1093/GIGASCIENCE/GIY150
关键词:
摘要: Background Large-scale computational prediction of protein structures represents a cost-effective alternative to empirical structure determination with particular promise for non-model organisms and neglected pathogens. Conventional sequence-based tools are insufficient annotate the genomes such divergent biological systems. Conversely, tolerates substantial variation in primary amino acid sequence is thus robust indicator biochemical function. Structural proteomics poised become standard part pathogen genomics research; however, informatic methods now required assign confidence large volumes predicted structures. Aims Our aim was predict proteome human pathogen, Giardia duodenalis, stratify into high- lower-confidence categories using variety metrics isolation combination. Methods We used I-TASSER suite structural models ∼5,000 proteins encoded G. duodenalis identify their closest empirically-determined homologues Protein Data Bank. Models were assigned or depending on presence matching family (Pfam) domains query reference peptides. Metrics output from derived assessed ability high-confidence category individually, combination through development random forest classifier. Results identified 1,095 including 212 hypothetical proteins. Amino identity between peptides greatest individual predictor status; classifier outperformed any metric (area under receiver operating characteristic curve = 0.976) subset 305 high-confidence-like models, corresponding false-positive predictions. High-confidence exhibited greater transcriptional abundance, generalized across species, indicating broad utility this approach automatically stratifying Additional structure-based clustering cross-check predictions an expanded Nek kinases. Several yielded new insight mechanisms redox balance duodenalis-a system central efficacy limited anti-giardial drugs. Conclusion combined machine learning can aid genome annotation genetically organisms, pathogens, promote efficient allocation resources experimental investigation.