Beyond the E-Value: Stratified Statistics for Protein Domain Prediction

作者: Alejandro Ochoa , John D. Storey , Manuel Llinás , Mona Singh

DOI: 10.1371/JOURNAL.PCBI.1004509

关键词: StatisticProtein domainMultiple comparisons problemRandom sequenceMarkov modelComputer scienceFalse discovery rateHidden Markov modelData miningStatistical hypothesis testingStatistics

摘要: E-values have been the dominant statistic for protein sequence analysis past two decades: from identifying statistically significant local alignments to evaluating matches hidden Markov models describing domain families. Here we formally show that “stratified” multiple hypothesis testing problems—that is, those in which statistical tests can be partitioned naturally—controlling False Discovery Rate (lFDR) per stratum, or partition, yields most predictions across data at any given threshold on FDR E-value over all strata combined. For important problem of prediction, a key step characterizing structure, function and evolution, stratifying by family excellent results. We develop first FDR-estimating algorithms evaluate how well thresholds based q-values, lFDRs perform prediction using five complementary approaches estimating empirical FDRs this context. stratified q-value substantially outperform E-values. Contradicting our theoretical results, q-values also lFDRs; however, reveal small but coherent subset families, biased towards specific repetitive patterns, weaknesses random yield notably inaccurate significance measures. Usage lFDR remaining as-expected noise, suggesting further improvements achieved with improved modeling sequences. Overall, findings suggest use could result host structured problems arising bioinformatics, including genome-wide association studies, orthology motif scanning.

参考文章(65)
Miodrag Lovric, International encyclopedia of statistical science Springer. ,(2011)
John D. Storey, False Discovery Rate. International Encyclopedia of Statistical Science. pp. 504- 508 ,(2011)
J. D. Storey, R. Tibshirani, Statistical significance for genomewide studies Proceedings of the National Academy of Sciences of the United States of America. ,vol. 100, pp. 9440- 9445 ,(2003) , 10.1073/PNAS.1530509100
Travis J. Wheeler, Jody Clements, Sean R. Eddy, Robert Hubley, Thomas A. Jones, Jerzy Jurka, Arian F. A. Smit, Robert D. Finn, Dfam: a database of repetitive DNA based on profile hidden Markov models Nucleic Acids Research. ,vol. 41, pp. 70- 82 ,(2012) , 10.1093/NAR/GKS1265
Feng Chen, Aaron J Mackey, Christian J Stoeckert Jr, David S Roos, OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups Nucleic Acids Research. ,vol. 34, pp. 363- 368 ,(2006) , 10.1093/NAR/GKJ123
Hyungwon Choi, Alexey I. Nesvizhskii, Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. Journal of Proteome Research. ,vol. 7, pp. 254- 265 ,(2008) , 10.1021/PR070542G
Hyungwon Choi, Debashis Ghosh, Alexey I. Nesvizhskii, Statistical validation of peptide identifications in large-scale proteomics using the target-decoy database search strategy and flexible mixture modeling. Journal of Proteome Research. ,vol. 7, pp. 286- 292 ,(2008) , 10.1021/PR7006818
Song Yang, Philip E. Bourne, The Evolutionary History of Protein Domains Viewed by Species Phylogeny PLoS ONE. ,vol. 4, pp. e8378- ,(2009) , 10.1371/JOURNAL.PONE.0008378
I. Letunic, T. Doerks, P. Bork, SMART 6: recent updates and new developments Nucleic Acids Research. ,vol. 37, pp. 229- 232 ,(2009) , 10.1093/NAR/GKN808