作者: Kumar P Mainali , Sharon Bewick , Peter Thielen , Thomas Mehoke , Florian P Breitwieser
DOI: 10.1371/JOURNAL.PONE.0187132
关键词: Jaccard index 、 Spurious relationship 、 Statistics 、 Co-occurrence 、 Correlation coefficient 、 Null model 、 Rare species 、 Macroecology 、 Similarity (network science) 、 Biology
摘要: Drawing on a long history in macroecology, correlation analysis of microbiome datasets is becoming common practice for identifying relationships or shared ecological niches among bacterial taxa. However, many the statistical issues that plague such analyses macroscale communities remain unresolved microbial communities. Here, we discuss problems species correlations based presence-absence data. We focus data because this information more readily obtainable from sequencing studies, especially whole-genome sequencing, where abundance estimation still its infancy. First, show how Pearson’s coefficient (r) and Jaccard’s index (J)–two most metrics data–can contradict each other when applied to typical dataset. In our dataset, example, 14% species-pairs predicted be significantly correlated by r were not using J, while 37.4% J r. Mismatch was particularly with at least one rare (<10% prevalence), explaining why might differ strongly datasets, there are large numbers Indeed 74% all study had species. Next, can result artificial inflation positive taxon particular problem studies. then illustrate similarity (J) yield improvements over coefficient. standard null model flawed, thus introduces own set spurious conclusions. identify better hypergeometric distribution, which appropriately corrects prevalence. This available recent statistics literature, used evaluating significance any value an empirically observed index. The resulting simple, yet effective method handling provides robust means testing finding and/or environmental responses