Strobemers: an alternative to k-mers for sequence comparison

作者: Kristoffer Sahlin

DOI: 10.1101/2021.01.28.428549

关键词: Selection (genetic algorithm)IndelAlgorithmMutation rateSensitivity (control systems)Sequence (medicine)Variable (computer science)PermutationComputer science

摘要: K-mer-based methods are widely used in bioinformatics for various types of sequence comparison. However, a single mutation will mutate k consecutive k-mers and makes most k-mer based applications comparison sensitive to variable rates. Many techniques have been studied overcome this sensitivity, e.g., spaced permutation techniques, but these do not handle indels well. For indels, pairs or groups small commonly used, first produce matches, only second step, pairing grouping is performed. Such many redundant matches due the size k. Here, we propose strobemers as an alternative Intuitively, consists linked minimizers. We show that under certain minimizer selection technique, provide more evenly distributed than less different rates distributions. Strobemers also give higher total coverage across sequences. useful performing comparisons read alignment, clustering, classification, error-correction.

参考文章(56)
Pavel A. Pevzner, 1-Tuple DNA Sequencing: Computer Analysis Journal of Biomolecular Structure & Dynamics. ,vol. 7, pp. 63- 73 ,(1989) , 10.1080/07391102.1989.10507752
Maxime Crochemore, Costas Iliopoulos, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, Extracting powers and periods in a string from its runs structure string processing and information retrieval. ,vol. 6393, pp. 258- 269 ,(2010) , 10.1007/978-3-642-16321-0_27
Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James P Drake, Jane M Landolin, Adam M Phillippy, Assembling large genomes with single-molecule sequencing and locality-sensitive hashing Nature Biotechnology. ,vol. 33, pp. 623- 630 ,(2015) , 10.1038/NBT.3238
Moses S. Charikar, Similarity estimation techniques from rounding algorithms symposium on the theory of computing. pp. 380- 388 ,(2002) , 10.1145/509907.509965
Yanni Sun, Jeremy Buhler, Designing multiple simultaneous seeds for DNA similarity search. Journal of Computational Biology. ,vol. 12, pp. 847- 861 ,(2005) , 10.1089/CMB.2005.12.847
Rob Patro, Stephen M Mount, Carl Kingsford, Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms Nature Biotechnology. ,vol. 32, pp. 462- 464 ,(2014) , 10.1038/NBT.2862
Ryan P. Abo, Matthew Ducar, Elizabeth P. Garcia, Aaron R. Thorner, Vanesa Rojas-Rudilla, Ling Lin, Lynette M. Sholl, William C. Hahn, Matthew Meyerson, Neal I. Lindeman, Paul Van Hummelen, Laura E. MacConaill, BreaKmer: detection of structural variation in targeted massively parallel sequencing data using kmers Nucleic Acids Research. ,vol. 43, ,(2015) , 10.1093/NAR/GKU1211
Benny Chor, David Horn, Nick Goldman, Yaron Levy, Tim Massingham, Genomic DNA k-mer spectra: models and modalities Genome Biology. ,vol. 10, pp. 1- 10 ,(2009) , 10.1186/GB-2009-10-10-R108
Jeremy Buhler, Uri Keich, Yanni Sun, Designing seeds for similarity search in genomic DNA Journal of Computer and System Sciences. ,vol. 70, pp. 342- 363 ,(2005) , 10.1016/J.JCSS.2004.12.003
R. Chikhi, P. Medvedev, Informed and automated k-mer size selection for genome assembly Bioinformatics. ,vol. 30, pp. 31- 37 ,(2014) , 10.1093/BIOINFORMATICS/BTT310