Toward effective software solutions for big biology

作者: Pjotr Prins , Joep de Ligt , Artem Tarasov , Ritsert C Jansen , Edwin Cuppen

DOI: 10.1038/NBT.3240

关键词:

摘要: Leading scientists tell us that the problem of large data and integration, referred to as 'big data', is acute hurting research. Recently, Snijder et al.1 suggested a culture change in which would aim share high-dimensional among laboratories. It important realize sharing only part solution. The elephant room bioinformatics software development particular—which, despite being crucially important, mostly fails address requirements data'. Whereas Internet companies such Google, Facebook Skype have built infrastructure developed innovative solutions cope with vast amounts data, bioscience community seems be struggling big projects. This has led problems sharing, annotation, computation reproducibility data2, 3, 4. Before we can devise for there are more basic pressing concerns need resolved. Biologists not formally trained engineering, so much available today been by PhD biologists relative isolation on back funded experimental research programs. model tied wet-lab work well but resulted 'one-offs'. most projects obtain results shortest possible time, this often achieved writing prototype rather than developing well-engineered scalable solutions. Even when funding obtained develop software, usually no long-term resources allocated maintenance, bug fixing, continuity reproducibility. Instead working alone researchers join or start collaborative free open-source (FOSS) projects, thereby improving their coding skills through scrutiny peers. True FOSS licenses allow continuation were abandoned original developers, enabling modular development. We published manifesto practical guide FOSS-style (https://github.com/pjotrp/bioinformatics/blob/master/README.md) aims provide process architecture guidelines early-career bioinformaticians supervisors. Bioinformatics already vibrant Galaxy, Cytoscape, BioPerl Biopython, these worked part-time owing lack inadequate will service biology without major additional investment. For example, after initial from US National Institutes Health (NIH) Science Foundation (NSF), Galaxy project now seeking new continue its work, funds at all granted scientific agencies Biopython. amount dedicated remains small. NIH budget $30 billion, an estimated 2–4% grants. estimate small fraction used By comparison, nonprofit Mozilla turns over $300 million annually promotion, Google invests $6.7 billion RD emphasize approaches; build existing grassroots initiatives5; create split streams hardware; support maintenance projects; encourage collaboration experts high-performance computing engineering; fund larger

参考文章(5)
Berend Snijder, Richard Kumaran Kandasamy, Giulio Superti-Furga, Toward effective sharing of high-dimensional immunology data. Nature Biotechnology. ,vol. 32, pp. 755- 759 ,(2014) , 10.1038/NBT.2974
Oswaldo Trelles, Pjotr Prins, Marc Snir, Ritsert C. Jansen, Big data, but are we ready? Nature Reviews Genetics. ,vol. 12, pp. 224- 224 ,(2011) , 10.1038/NRG2857-C1
Francis S. Collins, Lawrence A. Tabak, Policy: NIH plans to enhance reproducibility Nature. ,vol. 505, pp. 612- 613 ,(2014) , 10.1038/505612A
Vivien Marx, Biology: The big challenges of big data Nature. ,vol. 498, pp. 255- 260 ,(2013) , 10.1038/498255A
Steffen Möller, Enis Afgan, Michael Banck, Raoul JP Bonnal, Timothy Booth, John Chilton, Peter JA Cock, Markus Gumbel, Nomi Harris, Richard Holland, Matúš Kalaš, László Kaján, Eri Kibukawa, David R Powel, Pjotr Prins, Jacqueline Quinn, Olivier Sallou, Francesco Strozzi, Torsten Seemann, Clare Sloggett, Stian Soiland-Reyes, William Spooner, Sascha Steinbiss, Andreas Tille, Anthony J Travis, Roman Valls Guimera, Toshiaki Katayama, Brad A Chapman, Community-driven development for computational biology at Sprints, Hackathons and Codefests BMC Bioinformatics. ,vol. 15, pp. 1- 7 ,(2014) , 10.1186/1471-2105-15-S14-S7