Skip to content
This repository has been archived by the owner on Apr 18, 2024. It is now read-only.

Commit

Permalink
*v0.6 (April 25, 2023)*: Upgraded Python 3.6 to 3.9. Upgraded Whatsha…
Browse files Browse the repository at this point in the history
…p to [v1.7](HKU-BAL/Clair3#193). Fix gVCF format [error](#3). Added and [fixed](#4) "--enable_phasing" support. and "--enable_output_phasing" and "enable_output_haplotagging" supports.
  • Loading branch information
sujunhao committed Apr 25, 2023
1 parent 73521b8 commit a24629e
Show file tree
Hide file tree
Showing 11 changed files with 316 additions and 181 deletions.
19 changes: 9 additions & 10 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86
conda config --add channels defaults && \
conda config --add channels bioconda && \
conda config --add channels conda-forge && \
conda create -n clair3 python=3.6.10 -y
conda create -n clair3 python=3.9.0 -y

ENV PATH /opt/conda/envs/clair3/bin:$PATH
ENV CONDA_DEFAULT_ENV clair3
Expand All @@ -31,15 +31,14 @@ RUN /bin/bash -c "source activate clair3" && \
conda install -c conda-forge pypy3.6 -y && \
pypy3 -m ensurepip && \
pypy3 -m pip install mpmath==1.2.1 && \
pip install tensorflow-cpu==2.2.0 && \
pip install tensorflow-addons==0.11.2 tables==3.6.1 && \
conda install -c anaconda pigz==2.4 -y && \
conda install -c anaconda cffi==1.14.4 -y && \
conda install -c conda-forge parallel=20191122 zstd=1.4.4 -y && \
conda install -c conda-forge -c bioconda samtools=1.10 -y && \
conda install -c conda-forge -c bioconda whatshap=1.0 -y && \
conda install -c conda-forge xz zlib bzip2 -y && \
conda install -c conda-forge automake curl -y && \
conda install -c conda-forge tensorflow==2.8.0 -y && \
conda install -c conda-forge pytables -y && \
conda install -c anaconda pigz cffi==1.14.4 -y && \
conda install -c conda-forge parallel=20191122 zstd -y && \
conda install -c conda-forge -c bioconda samtools=1.15.1 -y && \
conda install -c conda-forge -c bioconda whatshap=1.7 -y && \
conda install -c conda-forge xz zlib bzip2 automake curl -y && \
pip install tensorflow-addons && \
rm -rf /opt/conda/pkgs/* && \
rm -rf /root/.cache/pip && \
echo "source activate clair3" > ~/.bashrc
Expand Down
32 changes: 16 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ Detailed descriptions of the methodology and results for Clair3-Trio are availab

## Latest Updates

*v0.6 (April 25, 2023)*: Upgraded Python 3.6 to 3.9. Upgraded Whatshap to [v1.7](https://github.com/HKU-BAL/Clair3/issues/193). Fix gVCF format [error](https://github.com/HKU-BAL/Clair3-Trio/issues/3). Added and [fixed](https://github.com/HKU-BAL/Clair3-Trio/issues/4) "--enable_phasing" support. and "--enable_output_phasing" and "enable_output_haplotagging" supports.

*v0.5 (April 10, 2023)*: Added **gVCF support**. Using the "--gvcf" flag to enable gVCF output.

*v0.4 (March 22, 2023)*: A model for R10.4 pore with the Kit 14 chemistry (**Q20+**) is available now. Check [this page](https://github.com/HKU-BAL/Clair3-Trio/blob/trio/docs/trio/q20.md) for more information about the model.
Expand Down Expand Up @@ -148,7 +150,7 @@ chmod +x ./Miniconda3-latest-Linux-x86_64.sh

```bash
# create and activate an environment named clair3
conda create -n clair3 python=3.6.10 -y
conda create -n clair3 python=3.9.0 -y
source activate clair3

# install pypy and packages in the environemnt
Expand All @@ -157,12 +159,15 @@ pypy3 -m ensurepip
pypy3 -m pip install mpmath==1.2.1

# install python packages in environment
pip3 install tensorflow==2.2.0
pip3 install tensorflow-addons==0.11.2 tables==3.6.1
conda install -c anaconda pigz==2.4 -y
conda install -c conda-forge parallel=20191122 zstd=1.4.4 -y
conda install -c conda-forge -c bioconda samtools=1.10 -y
conda install -c conda-forge -c bioconda whatshap=1.0 -y
conda install -c conda-forge tensorflow==2.8.0 -y
conda install -c conda-forge pytables -y
conda install -c anaconda pigz cffi==1.14.4 -y
conda install -c conda-forge parallel=20191122 zstd -y
conda install -c conda-forge -c bioconda samtools=1.15.1 -y
conda install -c conda-forge -c bioconda whatshap=1.7 -y
conda install -c conda-forge xz zlib bzip2 automake curl -y
# tensorflow-addons is required in training
pip install tensorflow-addons

# clone Clair3-Trio
git clone https://github.com/HKU-BAL/Clair3-Trio.git
Expand Down Expand Up @@ -339,18 +344,13 @@ docker run -it hkubal/clair3-trio:latest /opt/bin/run_clair3_trio.sh --help
--pileup_model_prefix=STR EXPERIMENTAL: Model prefix in pileup calling, including $prefix.data-00000-of-00002, $prefix.data-00001-of-00002 $prefix.index. default: pileup.
--fa_model_prefix=STR EXPERIMENTAL: Model prefix in full-alignment calling, including $prefix.data-00000-of-00002, $prefix.data-00001-of-00002 $prefix.index, default: full_alignment.
--trio_model_prefix=STR EXPERIMENTAL: Model prefix in trio calling, including $prefix.data-00000-of-00002, $prefix.data-00001-of-00002 $prefix.index, default: trio.'
--trio_model_prefix=STR EXPERIMENTAL: Model prefix in trio calling, including $prefix.data-00000-of-00002, $prefix.data-00001-of-00002 $prefix.index, default: trio.
--var_pct_full=FLOAT EXPERIMENTAL: Specify an expected percentage of low quality 0/1 and 1/1 variants called in the pileup mode for full-alignment mode calling, default: 0.3.
--ref_pct_full=FLOAT EXPERIMENTAL: Specify an expected percentage of low quality 0/0 variants called in the pileup mode for full-alignment mode calling, default: 0.3 for ilmn and hifi, 0.1 for ont.
--var_pct_phasing=FLOAT EXPERIMENTAL: Specify an expected percentage of high quality 0/1 variants used in Clair3 WhatsHap phasing, default: 0.8 for ont guppy5 and 0.7 for other platforms.

--enable_phasing [X] Output phased variants using whatshap, default: disable.
--remove_intermediate_dir [X] Remove intermediate directory, including intermediate phased BAM, pileup and full-alignment results. default: disable.
--haploid_precise [X] EXPERIMENTAL: Enable haploid calling mode. Only 1/1 is considered as a variant, default: disable.
--haploid_sensitive [X] EXPERIMENTAL: Enable haploid calling mode. 0/1 and 1/1 are considered as a variant, default: disable.
--no_phasing_for_fa [X] EXPERIMENTAL: Call variants without whatshap phasing in full alignment calling, default: disable.
--call_snp_only [X] EXPERIMENTAL: Call candidates pass SNP minimum AF only, ignore Indel candidates, default: disable.
--enable_long_indel [X] EXPERIMENTAL: Call long Indel variants(>50 bp), default: disable.
--enable_output_phasing Output phased variants using whatshap, default: disable.
--enable_output_haplotagging Output enable_output_haplotagging BAM variants using whatshap, default: disable.
--enable_phasing It means `--enable_output_phasing`. The option is retained for backward compatibility.
```
Expand Down
8 changes: 4 additions & 4 deletions preprocess/CheckEnvs.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,10 @@
list(range(1, 23)) + ["X", "Y"]]

required_tool_version = {
'python': LooseVersion('3.6.10'),
'python': LooseVersion('3.9'),
'pypy': LooseVersion('3.6'),
'samtools': LooseVersion('1.10'),
'whatshap': LooseVersion('1.0'),
'samtools': LooseVersion('1.15'),
'whatshap': LooseVersion('1.7'),
'parallel': LooseVersion('20191122'),
}

Expand Down Expand Up @@ -251,7 +251,7 @@ def CheckEnvs(args):
'python': LooseVersion(sys.version.split()[0]),
'pypy': check_version(tool=pypy, pos=0, is_pypy=True),
'samtools': check_version(tool=samtools, pos=1),
'whatshap': check_version(tool=whatshap, pos=1),
'whatshap': check_version(tool=whatshap, pos=0),
'parallel': check_version(tool=parallel, pos=2),
}
check_tools_version(tool_version, required_tool_version)
Expand Down
15 changes: 11 additions & 4 deletions preprocess/CreateTensorFullAlignment.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,8 @@ def get_tensor_info(base_info, bq, ref_base, read_mq):
base_upper = evc_base_from(base_upper)
ALT_BASE = ACGT_NUM[base_upper]

REF_BASE = ACGT_NUM[ref_base]
#REF_BASE = ACGT_NUM[ref_base]
REF_BASE = ACGT_NUM[ref_base if ref_base.upper() in "ACGT" else 'A']
if len(indel) and indel[0] in '+-':
if indel[0] == "+":
ins_base = indel[1:].upper()
Expand Down Expand Up @@ -683,9 +684,15 @@ def samtools_pileup_generator_from(samtools_mpileup_process):
continue
pileup_bases = columns[4]
raw_base_quality = columns[5]
read_name_list = columns[6].split(',')
raw_mapping_quality = columns[7]
reference_base = evc_base_from(reference_sequence[pos - reference_start].upper()) # ev
#read_name_list = columns[6].split(',')
#raw_mapping_quality = columns[7]
#reference_base = evc_base_from(reference_sequence[pos - reference_start].upper()) # ev
# samtools change mapping quality and read name order in v1.15.1
mq_index = 6 if len(columns[6]) <= len(columns[7]) else 7
rn_index = 7 if mq_index == 6 else 6
raw_mapping_quality = columns[mq_index]
read_name_list = columns[rn_index].split(',')
reference_base = reference_sequence[pos - reference_start].upper()
base_list, depth, pass_af, af = decode_pileup_bases(pileup_bases=pileup_bases,
reference_base=reference_base,
minimum_af_for_candidate=minimum_af_for_candidate,
Expand Down
1 change: 1 addition & 0 deletions preprocess/CreateTensorPileup.py
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,7 @@ def CreateTensorPileup(args):
if is_ctg_range_given:
extend_start = ctg_start - no_of_positions
extend_end = ctg_end + no_of_positions
extend_start = max(1, extend_start)
reads_regions.append(region_from(ctg_name=ctg_name, ctg_start=extend_start, ctg_end=extend_end))
reference_start, reference_end = ctg_start - param.expandReferenceRegion, ctg_end + param.expandReferenceRegion
reference_start = 1 if reference_start < 1 else reference_start
Expand Down
1 change: 1 addition & 0 deletions preprocess/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -572,6 +572,7 @@ def _print_vcf_header(self):
##FILTER=<ID=RefCall,Description="Reference call">
##INFO=<ID=P,Number=0,Type=Flag,Description="Result from pileup calling">
##INFO=<ID=F,Number=0,Type=Flag,Description="Result from full-alignment calling">
##INFO=<ID=T,Number=0,Type=Flag,Description="Result from trio model calling">
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position (for use with symbolic alleles)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
Expand Down
36 changes: 15 additions & 21 deletions run_clair3_trio.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Usage="Usage: ./${SCRIPT_NAME} --bam_fn_c=BAM --bam_fn_p1=BAM --bam_fn_p2=BAM --


# ENTRANCE SCRIPT FOR CLAIR3-TRIO, SETTING VARIABLE AND CALL TRIO
VERSION='v0.5'
VERSION='v0.6'

set -e
#./run_clair3_trio.sh --bam_fn_c=child_bam --bam_fn_p1=parent1 --bam_fn_p2=parent2 -f ref.fasta -t 32 -o tmp -p --model_path_clair3=model_path --model_path_clair3_trio=model_path
Expand Down Expand Up @@ -49,20 +49,14 @@ print_help_messages()
echo $'--include_all_ctgs Call variants on all contigs, otherwise call in chr{1..22,X,Y} and {1..22,X,Y}, default: disable.'
echo $'--snp_min_af=FLOAT Minimum SNP AF required for a candidate variant. Lowering the value might increase a bit of sensitivity in trade of speed and accuracy, default: ont:0.08,hifi:0.08,ilmn:0.08.'
echo $'--indel_min_af=FLOAT Minimum Indel AF required for a candidate variant. Lowering the value might increase a bit of sensitivity in trade of speed and accuracy, default: ont:0.15,hifi:0.08,ilmn:0.08.'
echo $'--remove_intermediate_dir Remove intermediate directory, including intermediate phased BAM, pileup and full-alignment results. default: disable.'
echo $'--trio_model_prefix=STR Model prefix in trio calling, including $prefix.data-00000-of-00002, $prefix.data-00001-of-00002 $prefix.index, default: trio.'
echo $'--enable_phasing EXPERIMENTAL: Output phased variants using whatshap, default: disable.'
echo $'--enable_output_phasing Output phased variants using whatshap, default: disable.'
echo $'--enable_output_haplotagging Output enable_output_haplotagging BAM variants using whatshap, default: disable.'
echo $'--enable_phasing It means `--enable_output_phasing`. The option is retained for backward compatibility.'
echo $'--var_pct_full=FLOAT EXPERIMENTAL: Specify an expected percentage of low quality 0/1 and 1/1 variants called in the pileup mode for full-alignment mode calling, default: 0.3.'
echo $'--ref_pct_full=FLOAT EXPERIMENTAL: Specify an expected percentage of low quality 0/0 variants called in the pileup mode for full-alignment mode calling, default: 0.1 .'
echo $'--var_pct_phasing=FLOAT EXPERIMENTAL: Specify an expected percentage of high quality 0/1 variants used in WhatsHap phasing, default: 0.8 for ont guppy5 and 0.7 for other platforms.'
echo $'--pileup_model_prefix=STR EXPERIMENTAL: Model prefix in pileup calling, including $prefix.data-00000-of-00002, $prefix.data-00001-of-00002 $prefix.index. default: pileup.'
echo $'--fast_mode EXPERIMENTAL: Skip variant candidates with AF <= 0.15, default: disable.'
echo $'--haploid_precise EXPERIMENTAL: Enable haploid calling mode. Only 1/1 is considered as a variant, default: disable.'
echo $'--haploid_sensitive EXPERIMENTAL: Enable haploid calling mode. 0/1 and 1/1 are considered as a variant, default: disable.'
echo $'--call_snp_only EXPERIMENTAL: Call candidates pass SNP minimum AF only, ignore Indel candidates, default: disable.'
echo $'--enable_long_indel EXPERIMENTAL: Call long Indel variants(>50 bp), default: disable.'
echo $'--no_phasing_for_fa EXPERIMENTAL: Call variants without whatshap phasing in full alignment calling, default: disable.'
echo $'-p, --platform=STR EXPERIMENTAL: [DO NOT CHANGE] Select the sequencing platform of the input. Possible options: {ont}.'
echo $''
}

Expand All @@ -79,7 +73,7 @@ NC="\\033[0m"
ARGS=`getopt -o b:f:t:p:o:hv \
-l bam_fn_c:,bam_fn_p1:,bam_fn_p2:,ref_fn:,threads:,model_path_clair3:,model_path_clair3_trio:,platform:,output:,\
bed_fn::,vcf_fn::,ctg_name::,sample_name_c::,sample_name_p1::,sample_name_p2::,qual::,samtools::,python::,pypy::,parallel::,whatshap::,chunk_num::,chunk_size::,var_pct_full::,ref_pct_full::,var_pct_phasing::,\
resumn::,snp_min_af::,indel_min_af::,pileup_model_prefix::,trio_model_prefix::,fast_mode,gvcf,pileup_only,pileup_phasing,print_ref_calls,haploid_precise,haploid_sensitive,include_all_ctgs,no_phasing_for_fa,call_snp_only,remove_intermediate_dir,enable_phasing,enable_long_indel,gvcf,help,version -n 'run_clair3_trio.sh' -- "$@"`
resumn::,snp_min_af::,indel_min_af::,pileup_model_prefix::,trio_model_prefix::,fast_mode,gvcf,pileup_only,pileup_phasing,print_ref_calls,haploid_precise,haploid_sensitive,include_all_ctgs,no_phasing_for_fa,call_snp_only,remove_intermediate_dir,enable_output_phasing,enable_output_haplotagging,enable_phasing,enable_long_indel,gvcf,help,version -n 'run_clair3_trio.sh' -- "$@"`

if [ $? != 0 ] ; then echo"No input. Terminating...">&2 ; exit 1 ; fi
eval set -- "${ARGS}"
Expand Down Expand Up @@ -118,6 +112,8 @@ INCLUDE_ALL_CTGS=False
NO_PHASING=False
RM_TMP_DIR=False
ENABLE_PHASING=False
ENABLE_OUTPUT_PHASING=False
ENABLE_OUTPUT_HAPLOTAGGING=False
ENABLE_LONG_INDEL=False
PILEUP_PREFIX="pileup"
TRIO_PREFIX="trio"
Expand Down Expand Up @@ -168,6 +164,8 @@ while true; do
--remove_intermediate_dir ) RM_TMP_DIR=True; shift 1 ;;
--enable_long_indel ) ENABLE_LONG_INDEL=True; shift 1 ;;
--enable_phasing ) ENABLE_PHASING=True; shift 1 ;;
--enable_output_phasing ) ENABLE_OUTPUT_PHASING=True; shift 1 ;;
--enable_output_haplotagging ) ENABLE_OUTPUT_HAPLOTAGGING=True; shift 1 ;;

-- ) shift; break; ;;
-h|--help ) print_help_messages; exit 0 ;;
Expand Down Expand Up @@ -230,6 +228,8 @@ OUTPUT_FOLDER=$(echo ${OUTPUT_FOLDER%*/})
MODEL_PATH_C3=$(echo ${MODEL_PATH_C3%*/})
MODEL_PATH_C3T=$(echo ${MODEL_PATH_C3T%*/})
if [ ${GVCF} ]; then SHOW_REF=True ; fi
if [ ${ENABLE_PHASING} ]; then ENABLE_OUTPUT_PHASING=True ; fi
if [ ${ENABLE_OUTPUT_HAPLOTAGGING} ]; then ENABLE_OUTPUT_PHASING=True ; fi

# optional parameters should use "="
(time (
Expand Down Expand Up @@ -264,19 +264,12 @@ echo "[INFO] FULL ALIGN REFERENCE PROPORTION: ${REF_PRO}"
echo "[INFO] PHASING PROPORTION: ${PHASING_PCT}"
if [ ${SNP_AF} -gt 0 ]; then echo "[INFO] USER DEFINED SNP THRESHOLD: ${SNP_AF}"; fi
if [ ${INDEL_AF} -gt 0 ]; then echo "[INFO] USER DEFINED INDEL THRESHOLD: ${INDEL_AF}"; fi
echo "[INFO] ENABLE FILEUP ONLY CALLING: ${PILEUP_ONLY}"
echo "[INFO] ENABLE PILEUP CALLING AND PHASING: ${PILEUP_PHASING}"
echo "[INFO] ENABLE FAST MODE CALLING: ${FAST_MODE}"
echo "[INFO] ENABLE CALLING SNP CANDIDATES ONLY: ${SNP_ONLY}"
echo "[INFO] ENABLE PRINTING REFERENCE CALLS: ${SHOW_REF}"
echo "[INFO] ENABLE OUTPUT GVCF: ${GVCF}"
echo "[INFO] ENABLE HAPLOID PRECISE MODE: ${HAP_PRE}"
echo "[INFO] ENABLE HAPLOID SENSITIVE MODE: ${HAP_SEN}"
echo "[INFO] ENABLE INCLUDE ALL CTGS CALLING: ${INCLUDE_ALL_CTGS}"
echo "[INFO] ENABLE NO PHASING FOR FULL ALIGNMENT: ${NO_PHASING}"
echo "[INFO] ENABLE REMOVING INTERMEDIATE FILES: ${RM_TMP_DIR}"
echo "[INFO] ENABLE PHASING VCF OUTPUT: ${ENABLE_PHASING}"
echo "[INFO] ENABLE LONG INDEL CALLING: ${ENABLE_LONG_INDEL}"
echo "[INFO] ENABLE PHASING VCF OUTPUT: ${ENABLE_OUTPUT_PHASING}"
echo "[INFO] ENABLE HAPLOTAGGING BAM OUTPUT: ${ENABLE_OUTPUT_HAPLOTAGGING}"
echo $''

# file check
Expand Down Expand Up @@ -381,7 +374,8 @@ ${SCRIPT_PATH}/trio/Call_Clair3_Trio.sh \
--pileup_model_prefix=${PILEUP_PREFIX} \
--trio_model_prefix=${TRIO_PREFIX} \
--remove_intermediate_dir=${RM_TMP_DIR} \
--enable_phasing=${ENABLE_PHASING} \
--enable_phasing=${ENABLE_OUTPUT_PHASING} \
--enable_output_haplotagging=${ENABLE_OUTPUT_HAPLOTAGGING} \
--enable_long_indel=${ENABLE_LONG_INDEL}


Expand Down
Loading

0 comments on commit a24629e

Please sign in to comment.