This repository contains examples for the genomics-data-index project, which is a system which can index large amounts of genomics data and enable rapid querying of this data.
Indexing breaks genomes up into individual features (nucleotide mutations, kmers, or genes/MLST) and stores the index in a directory which can easily be shared with other people. Indexes can be generated direct from sequence data or loaded from existing intermediate files (e.g., VCF files).
# Index features in VCF files listed in vcf-files.txt
gdi load vcf vcf-files.txt
Querying provides both a Python API and Command-line interface to select sets of samples using this index or attached external data (e.g., phylogenetic trees or DataFrames of metadata).
# Select samples with a 26568 C > A mutation
r = s.hasa('MN996528.1:26568:C:A')
Tutorials and a demonstration of the genomics-data-index software are available below. You can select the [launch | binder] badge to launch each of these tutorials in an interactive Jupyter environment within the cloud environment using Binder.
- Tutorial 1: Querying (Salmonella) -
- In case GitHub link is not rendering try here
- Tutorial 2: Indexing assemblies (SARS-CoV-2) -
- In case GitHub link is not rendering try here
- Tutorial 3: Querying overview -
- In case GitHub link is not rendering try here
Alternatively, you can run these tutorials on your local machine. In order to run these tutorials you will first have to install the genomics-data-index
software (see the Installation section for details). In addition, you will have to install Jupyter Lab. If you have already installed the genomics-data-index
software with conda you can install Jupyter Lab as follows:
conda activate gdi
conda install jupyterlab
To run Jupyter you can run the following:
jupyter lab
Please see the instructions for Jupyter Lab for details on using Jupyter.