作者: Gustavo Glusman , Alissa Severson , Varsha Dhankani , Max Robinson , Terry Farrah
关键词: Segmentation 、 Identification (information) 、 Normalization (statistics) 、 Biology 、 Copy-number variation 、 Whole genome sequencing 、 Structural variation 、 Data mining 、 Hidden Markov model 、 Genome
摘要: The identification of DNA copy numbers from short-read sequencing data remains a challenge for both technical and algorithmic reasons. raw these analyses are measured in tens to hundreds gigabytes per genome; transmitting, storing, analyzing such large files is cumbersome, particularly methods that analyze several samples simultaneously. We developed very efficient representation depth coverage (150–1000× compression) enables analyses. Current variants whole-genome (WGS) frequently miss number (CNVs), hemizygous deletions the 1–100 kb range. To fill this gap, we method identify CNVs individual genomes, based on comparison joint profiles pre-computed set genomes. analyzed over 6000 high quality (>40×) has strong sequence-specific fluctuations only partially explained by global parameters like %GC. account fluctuations, constructed multi-genome representing observed or inferred diploid at each position along genome. These Reference Coverage Profiles (RCPs) take into diverse technologies pipeline versions used. Normalization scaled RCP followed hidden Markov model (HMM) segmentation detection Use improves our ability make available RCPs tools performing personal expect increased sensitivity specificity genome analysis be critical achieving clinical-grade interpretation.