作者: René Warren , Yaron Butterfield , Steven Jones , Marco Marra
DOI:
关键词:
摘要: We have designed and implemented a sequence data management system integrated with our genome annotation pipeline to deal with genome sequence assembly data flow. The Sequence Assembly Manager (SAM) consists primarily of a perl CGI web application designed to easily manipulate and coordinate the analysis of genomic information and to view and report genome assembly progress. The user interface sits on top of a relational database, created for the storage of trace archive and assembly information. SAM manages the execution of sub applications required for the control of data storage, sequence assembly, file parsing, custom analysis and visualization of whole genome shotgun assembly data. The software includes a tool to compare sequence assemblies to fingerprint maps and this has already been useful in the identification of sequence misassemblies.A main advantage of this system is the ease and flexibility at which genome assemblies could be performed using all the sequence data available, or a subset of it. We concurrently use Phrap and Arachne for all our genome assemblies and the system has been designed to incorporate easily any new assemblers as they become available. Modular programs have been written to parse the output files that each assembler generates, analyze their content and populate the database accordingly, with assembly-specific information. This includes general and specific information about contigs, supercontigs, sequence composition, coverage and build statistics along with custom calculations such as the identification of gap-spanning and overlapping clones, clone insert size …