DeepVariant-on-Spark is a germline short variant calling pipeline that runs Google DeepVariant on Apache Spark at scale.
- DeepVariant is highly accurate. In 2016 DeepVariant won PrecisionFDA Truth Challenge in the best SNP Performance category.
- Apache Spark is a lightning-fast unified analytics engine for large-scale data processing. Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.
- DeepVariant (v0.7) hasn't supported multiple GPUs. Through DeepVariant-on-Spark, all of GPU resources can be fully utilized across multiple nodes. For example, nVidia DGX-1 has 8 Tesla V100.
- DeepVariant-on-Spark leverages Atgenomix SeqPiper, a wrapper technology using Spark PipeRDD, to parallelize DeepVariant pipeline on Spark and to use Yarn to optimize resource allocation in multi-node environment.
- Apache Hadoop 2.8.x
- Apache Spark 2.2.x
- Apache Adam v0.23 (forked and modified by Atgenomix)
- DeepVariant v0.7.0 (forked and modified by Atgenomix)
- DeepVariant-on-Spark quick start on Google Cloud vid DataProc
- DeepVariant-on-Spark WGS case study
- Multiple GPU acceleration
- Customization
- Trobuleshooting
Interested in contributing? See CONTRIBUTING.
DeepVariant-on-Spark is licensed under the terms of the Apache 2.0 License.
DeepVariant-on-Spark happily makes use of many open source packages. We'd like to specifically call out a few key ones:
We thank all of the developers and contributors to these packages for their work.
- This is not an official Atgenomix product.
- To utilize the official product with full experience, please contact Atgenomix ([email protected]).