作者: Alejandro Ochoa , John D. Storey , Manuel Llinás , Mona Singh
DOI: 10.1371/JOURNAL.PCBI.1004509
关键词: Statistic 、 Protein domain 、 Multiple comparisons problem 、 Random sequence 、 Markov model 、 Computer science 、 False discovery rate 、 Hidden Markov model 、 Data mining 、 Statistical hypothesis testing 、 Statistics
摘要: E-values have been the dominant statistic for protein sequence analysis past two decades: from identifying statistically significant local alignments to evaluating matches hidden Markov models describing domain families. Here we formally show that “stratified” multiple hypothesis testing problems—that is, those in which statistical tests can be partitioned naturally—controlling False Discovery Rate (lFDR) per stratum, or partition, yields most predictions across data at any given threshold on FDR E-value over all strata combined. For important problem of prediction, a key step characterizing structure, function and evolution, stratifying by family excellent results. We develop first FDR-estimating algorithms evaluate how well thresholds based q-values, lFDRs perform prediction using five complementary approaches estimating empirical FDRs this context. stratified q-value substantially outperform E-values. Contradicting our theoretical results, q-values also lFDRs; however, reveal small but coherent subset families, biased towards specific repetitive patterns, weaknesses random yield notably inaccurate significance measures. Usage lFDR remaining as-expected noise, suggesting further improvements achieved with improved modeling sequences. Overall, findings suggest use could result host structured problems arising bioinformatics, including genome-wide association studies, orthology motif scanning.