VecScreen_plus_taxonomy: imposing a tax(onomy) increase on vector contamination screening.

作者: Alejandro A Schäffer , Eric P Nawrocki , Yoon Choi , Paul A Kitts , Ilene Karsch-Mizrachi

DOI: 10.1093/BIOINFORMATICS/BTX669

关键词:

摘要: Motivation Nucleic acid sequences in public databases should not contain vector contamination, but many GenBank do (or did) vectors. The National Center for Biotechnology Information uses the program VecScreen to screen submitted contamination. Additional tools are needed distinguish true-positive (contamination) from false-positive (not contamination) matches. Results A principal reason matches is that sequence and matching subsequence originate closely related or identical organisms (for example, both Escherichia coli). We collected information on taxonomy of sources segments UniVec database used by VecScreen. two overlapping software pipelines retrospective analysis contamination prospective new submissions. Using pipeline, we identified corrected over 8000 contaminated nonredundant nucleotide database. pipeline has been production use since April 2017 evaluate some Availability implementation Data entries were included release 10.0 (ftp://ftp.ncbi.nih.gov/pub/UniVec/). main freely available at https://github.com/aaschaffer/vecscreen_plus_taxonomy. Contact aschaffe@helix.nih.gov. Supplementary data Bioinformatics online.

参考文章(20)
Jeffrey Scott Coker, Eric Davies, Identifying adaptor contamination when mining DNA sequence data. BioTechniques. ,vol. 37, pp. 194- 198 ,(2004) , 10.2144/04372BM03
Yun-Lung Li, Jui-Cheng Weng, Chiung-Chih Hsiao, Min-Te Chou, Chin-Wen Tseng, Jui-Hung Hung, PEAT: an intelligent and efficient paired-end sequencing adapter trimming algorithm BMC Bioinformatics. ,vol. 16, pp. 1- 11 ,(2015) , 10.1186/1471-2105-16-S1-S2
C Savakis, R Doelz, Contamination of cDNA sequences in databases Science. ,vol. 259, pp. 1677- 1678 ,(1993) , 10.1126/SCIENCE.8456288
Robert Schmieder, Yan Wei Lim, Forest Rohwer, Robert Edwards, TagCleaner: Identification and removal of tag sequences from genomic and metagenomic datasets BMC Bioinformatics. ,vol. 11, pp. 341- 341 ,(2010) , 10.1186/1471-2105-11-341
Juan Falgueras, Antonio J Lara, Noe Fernandez-Pozo, Francisco R. Canton, Guillermo Perez-Trabado, M. Gonzalo Claros, SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read BMC Bioinformatics. ,vol. 11, pp. 38- 38 ,(2010) , 10.1186/1471-2105-11-38
G. Seluja, A Farmer, M McLeod, C Harger, P. Schad, Establishing a method of vector contamination identification in database sequences. Bioinformatics. ,vol. 15, pp. 106- 110 ,(1999) , 10.1093/BIOINFORMATICS/15.2.106
Matthew Binns, Contamination of DNA database sequence entries with Escherichia coli insertion sequences Nucleic Acids Research. ,vol. 21, pp. 779- 779 ,(1993) , 10.1093/NAR/21.3.779
Owen White, Ted Dunning, Granger Sutton, Mark Adams, J. Craig Venter, Chris Fields, A quality control algorithm for DNA sequencing projects Nucleic Acids Research. ,vol. 21, pp. 3829- 3838 ,(1993) , 10.1093/NAR/21.16.3829