作者: Aleksandr Kovaltsuk , Konrad Krawczyk , Sebastian Kelm , James Snowden , Charlotte M. Deane
关键词: Word error rate 、 Line (text file) 、 Computer science 、 Repertoire 、 Drug discovery 、 Nucleic acid sequence 、 Computational biology 、 Sequence (medicine) 、 DNA sequencing 、 Variable (computer science)
摘要: Next-generation sequencing of the Ig gene repertoire (Ig-seq) produces large volumes information at nucleotide sequence level. Such data have improved our understanding immune systems across numerous species and already been successfully applied in vaccine development drug discovery. However, high-throughput nature Ig-seq means that it is afflicted by high error rates. This has led to error-correction approaches. Computational methods use alone, primarily designating sequences as likely be correct if they are observed frequently. In this work, we describe an orthogonal method for filtering data, which considers structural viability each sequence. A typical natural Ab structure requires presence a disulfide bridge within its variable chains maintain fold. Our Sequence Selector (ABOSS) uses presence/absence way both identifying structurally viable estimating rate. On simulated datasets, ABOSS able identify more than 99% sequences. Applying six independent datasets (one mouse five human), show calculations line with previous experimental computational estimates. We also how impossible missed other methods.