Skip to content

Commit

Permalink
Update PacBio documentation.
Browse files Browse the repository at this point in the history
PiperOrigin-RevId: 303140460
  • Loading branch information
pichuan committed Mar 26, 2020
1 parent e271f74 commit 4b937f0
Showing 1 changed file with 24 additions and 24 deletions.
48 changes: 24 additions & 24 deletions docs/deepvariant-pacbio-model-case-study.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,29 @@
# Using DeepVariant for small variant calling from PacBio CCS reads
# Using DeepVariant for small variant calling from PacBio HiFi reads

In this case study we describe applying DeepVariant to PacBio CCS reads to call
In this case study we describe applying DeepVariant to PacBio HiFi reads to call
variants. We will call small variants from a publicly available whole genome
CCS dataset from PacBio.
HiFi dataset from PacBio.

Starting from v0.10.0, sequence from amplified libraries is included in our
PacBio CCS training set, providing a significant accuracy boost to variant
detection from amplified CCS data.
In v0.8, DeepVariant released a model for PacBio HiFi data. Starting from
v0.10.0, sequence from amplified libraries is included in our PacBio HiFi
training set, providing a significant accuracy boost to variant detection from
amplified HiFi data.
In this case study we will apply the PacBio model by specifying `PACBIO` in
the `model_type` parameter in the `run_pacbio_case_study_docker.sh` script.

Case study is run on a standard Google Cloud instance. There are no special
This case study is run on a standard Google Cloud instance. There are no special
hardware or software requirements for running this case study. For consistency
we use Google Cloud instance with 64 cores and 128 GB of memory. This is NOT the
fastest or cheapest configuration. For more scalable execution of DeepVariant
see the [External Solutions] section.

In v0.8 DeepVariant released a model for PacBio CCS data. In this case study we
will apply PacBio model by specifying `PACBIO` in `model_type` parameter in the
`run_pacbio_case_study_docker.sh` script.

## Case study overview

Calling small variants using DeepVariant involves multiple steps:

1. Creating examples. Variants candidates are extracted from input BAM file
1. Creating examples. Candidate variants are extracted from an input BAM file
(previously aligned).
2. Calling Variants. Applying DeepVariant Convolutional Neural Network (CNN)
2. Calling variants. Applying DeepVariant convolutional neural network (CNN)
model to infer variants.
3. Exporting results to VCF.

Expand All @@ -37,13 +36,13 @@ There are multiple ways to run DeepVariant:
- Download prebuilt binaries.
- Download an official DeepVariant Docker image.

This case study is run using official DeepVariant Docker image.
This case study is run using the official DeepVariant Docker image.

## Running

For simplicity we provide a script that downloads the input data and runs all
the steps described above using DeepVariant Docker image. **Please note, that if
you create your own script make_examples must be called with
you create your own script `make_examples` must be called with
`--norealign_reads --vsc_min_fraction_indels 0.12` flag for PacBio long reads.**

1. Create a Google Cloud virtual instance. This command creates a virtual
Expand Down Expand Up @@ -74,7 +73,7 @@ curl https://raw.githubusercontent.com/google/deepvariant/r0.10/scripts/run_pacb

## Script description

Before running DeepVariant steps following input data is downloaded:
Before running the DeepVariant steps, the following input data is downloaded:

* BAM file: pacbio.8M.30x.bam. Publicly available PacBio BAM file.
[ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/
Expand All @@ -90,14 +89,15 @@ Before running DeepVariant steps following input data is downloaded:
downloaded from
[ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.3.2/GRCh37/](ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.3.2/GRCh37/)

Next following steps are executed:
Next the following steps are executed:

* make_examples. This step creates small variant candidates and stores them in
TensorFlow format.
* `make_examples`. This step creates small variant candidates and stores them
in TensorFlow format.

* call_variants. This step applys DeepVariant DNN to call small variants.
* `call_variants`. This step applies DeepVariant DNN to call small variants.

* postprocess_variants. This step converts data from TensorFlow format to VCF.
* `postprocess_variants`. This step converts data from TensorFlow format to
VCF.

* `hap.py` ([https://github.com/Illumina/hap.py]) program from Illumina is
used to evaluate the resulting vcf file. This serves as a check to ensure
Expand All @@ -114,9 +114,9 @@ Step | Wall time

## Accuracy metrics

PacBio model was trained using HG002 genome (the same genome we use for this
case study) with chromosomes 20, 21, 22 excluded. Therefore, we run evaluation
on chr20.
The PacBio model was trained using the HG002 genome (the same genome we use for
this case study) with chromosomes 20, 21, 22 excluded. Therefore, we run
evaluation on chr20.

Type | # TP | # FN | # FP | Recall | Precision | F1\_Score
----- | ----- | ---- | ---- | -------- | --------- | ---------
Expand Down

0 comments on commit 4b937f0

Please sign in to comment.