作者: Alejandro A Schäffer , Eric P Nawrocki , Yoon Choi , Paul A Kitts , Ilene Karsch-Mizrachi
DOI: 10.1093/BIOINFORMATICS/BTX669
关键词:
摘要: Motivation Nucleic acid sequences in public databases should not contain vector contamination, but many GenBank do (or did) vectors. The National Center for Biotechnology Information uses the program VecScreen to screen submitted contamination. Additional tools are needed distinguish true-positive (contamination) from false-positive (not contamination) matches. Results A principal reason matches is that sequence and matching subsequence originate closely related or identical organisms (for example, both Escherichia coli). We collected information on taxonomy of sources segments UniVec database used by VecScreen. two overlapping software pipelines retrospective analysis contamination prospective new submissions. Using pipeline, we identified corrected over 8000 contaminated nonredundant nucleotide database. pipeline has been production use since April 2017 evaluate some Availability implementation Data entries were included release 10.0 (ftp://ftp.ncbi.nih.gov/pub/UniVec/). main freely available at https://github.com/aaschaffer/vecscreen_plus_taxonomy. Contact aschaffe@helix.nih.gov. Supplementary data Bioinformatics online.