Beyond the E-Value: Stratified Statistics for Protein Domain Prediction

作者： Alejandro Ochoa , John D. Storey , Manuel Llinás , Mona Singh

关键词: Statistic 、 Protein domain 、 Multiple comparisons problem 、 Random sequence 、 Markov model 、 Computer science 、 False discovery rate 、 Hidden Markov model 、 Data mining 、 Statistical hypothesis testing 、 Statistics

摘要: E-values have been the dominant statistic for protein sequence analysis past two decades: from identifying statistically significant local alignments to evaluating matches hidden Markov models describing domain families. Here we formally show that “stratified” multiple hypothesis testing problems—that is, those in which statistical tests can be partitioned naturally—controlling False Discovery Rate (lFDR) per stratum, or partition, yields most predictions across data at any given threshold on FDR E-value over all strata combined. For important problem of prediction, a key step characterizing structure, function and evolution, stratifying by family excellent results. We develop first FDR-estimating algorithms evaluate how well thresholds based q-values, lFDRs perform prediction using five complementary approaches estimating empirical FDRs this context. stratified q-value substantially outperform E-values. Contradicting our theoretical results, q-values also lFDRs; however, reveal small but coherent subset families, biased towards specific repetitive patterns, weaknesses random yield notably inaccurate significance measures. Usage lFDR remaining as-expected noise, suggesting further improvements achieved with improved modeling sequences. Overall, findings suggest use could result host structured problems arising bioinformatics, including genome-wide association studies, orthology motif scanning.

参考文章(65)

Miodrag Lovric, International encyclopedia of statistical science Springer. ,(2011)

John D. Storey, False Discovery Rate. International Encyclopedia of Statistical Science. pp. 504- 508 ,(2011)

J. D. Storey, R. Tibshirani, Statistical significance for genomewide studies Proceedings of the National Academy of Sciences of the United States of America. ,vol. 100, pp. 9440- 9445 ,(2003) , 10.1073/PNAS.1530509100

Travis J. Wheeler, Jody Clements, Sean R. Eddy, Robert Hubley, Thomas A. Jones, Jerzy Jurka, Arian F. A. Smit, Robert D. Finn, Dfam: a database of repetitive DNA based on profile hidden Markov models Nucleic Acids Research. ,vol. 41, pp. 70- 82 ,(2012) , 10.1093/NAR/GKS1265

Feng Chen, Aaron J Mackey, Christian J Stoeckert Jr, David S Roos, OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups Nucleic Acids Research. ,vol. 34, pp. 363- 368 ,(2006) , 10.1093/NAR/GKJ123

Hyungwon Choi, Alexey I. Nesvizhskii, Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. Journal of Proteome Research. ,vol. 7, pp. 254- 265 ,(2008) , 10.1021/PR070542G

Hyungwon Choi, Debashis Ghosh, Alexey I. Nesvizhskii, Statistical validation of peptide identifications in large-scale proteomics using the target-decoy database search strategy and flexible mixture modeling. Journal of Proteome Research. ,vol. 7, pp. 286- 292 ,(2008) , 10.1021/PR7006818

Sean R. Eddy, A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation PLoS Computational Biology. ,vol. 4, pp. e1000069- ,(2008) , 10.1371/JOURNAL.PCBI.1000069

Song Yang, Philip E. Bourne, The Evolutionary History of Protein Domains Viewed by Species Phylogeny PLoS ONE. ,vol. 4, pp. e8378- ,(2009) , 10.1371/JOURNAL.PONE.0008378

10.

I. Letunic, T. Doerks, P. Bork, SMART 6: recent updates and new developments Nucleic Acids Research. ,vol. 37, pp. 229- 232 ,(2009) , 10.1093/NAR/GKN808

Beyond the E-Value: Stratified Statistics for Protein Domain Prediction

来源期刊

我的账户

Beyond the E-Value: Stratified Statistics for Protein Domain Prediction

来源期刊

相似文章 10

我的账户