NGSdb: A NGS Data Management and Analysis Platform for Comparative Genomics
Cobb, Marea Jean
MetadataShow full item record
As researchers continue to expand the volume of Next Generation Sequencing data, the ability to store and query the data becomes increasingly important. The current approach of using spreadsheets has become too complex and the data too vast to efficiently store, view, cross query, analyze, and share among collaborators. We have created and implemented a relational database schema, NGSdb (PostgreSQL), coupled with a user-friendly web interface (Django/Python), to address this growing need. NGSdb currently has five core components: a sample core, which tracks the sample information (e.g., organism, growth phase); a library core, which tracks the libraries constructed from samples (e.g., library type, sequencing method, raw data files); a genome core which stores information about reference genomes; an analysis core, where the meta-information of bioinformatics analyses are stored; and a result core where the results of the bioinformatic analyses are stored. I have expanded NGSdb by developing two analysis modules; a somy/CNV module and SNP module. In addition to storing and retrieving the data, the web interface also serves as an analytical platform. The database is designed to be modular, allowing for future additions as new technology or data becomes available. The modularity enables us to query across our different data types, such as SNP data and RNA-Seq data (e.g., how does the expression level change when a gene is mutated?). We demonstrate the capabilities of our system through two separate case studies. The first recapitulates a recently published genomic analysis of two Sri Lankan strains of Leishmania donovani, one causing visceral disease (VL) and one causing cutaneous disease (CL). The second case study compares the genome of a laboratory-adapted strain of L. donovani with genetically modified clones derived from it: single (sKO) and double (dKO) deletions of the dpkAR1 gene; and a derivative of dKO line that had recovered the wild type growth phenotype. We identified single nucleotide polymorphisms (SNP), copy number variation (CNV), and somy differences between these lines to expose what genomic differences may contribute to the growth phenotype recovery of the double knockouts. NGSdb successfully recaptured the analysis results previously published and identified a potential artifact in the second study. Through these analysis we have also established additions to NGSdb that we believe will further increase the usability of the system.