Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hg19 and hg38 from same workflow version #19

Open
wants to merge 49 commits into
base: master
Choose a base branch
from
Open

Hg19 and hg38 from same workflow version #19

wants to merge 49 commits into from

Conversation

NagaComBio
Copy link
Member

@NagaComBio NagaComBio commented Mar 5, 2021

Overview

Merging the hg38 indel calling branch to the main hg19 branch. Hg19 somatic indel analysis results will remain same as with version 2.6.0. There are some changes in the check sample swap part that will affect both hg19 and hg38.

Main

  • Support for both hg19 and hg38 (GRCh38_decoy_ebv_phiX_alt_hla_chr)
  • Support for both BAM and CRAM
  • Changes in the check sample swap
    • Because of the availability of the new germline workflow, removed the ANNOAR annotation of the germline variants.
    • To follow suit removed the bias annotations for the platypus SNVs in the sample swap detection part.
  • EVS and ExAC columns are removed in hg19 annotations, so result files are not completely backward compatible, needs a major version bump.

Minor

  • No control workflow
    • Local control max AF increased to 0.05
    • Annotation SNP_support_germline is based on the MAF filter value provided in the XML.

Integration test

  • Have performed integration tests on different scenarios, the results are in Phabricator T5184#87114

Summary by CodeRabbit

  • New Features

    • Support for hg38/GRCh38 reference in analysis tools.
    • New options for analysis workflows: --skipREMAP and runlowmaf.
    • Enhanced confidence scoring in variant analysis.
    • Added raw clustering and filtering steps in TiNDA-related processes.
    • Introduced maximum allele frequency controls in variant annotation.
  • Enhancements

    • Updated documentation to version 4.0.0 with detailed descriptions of new features and updates.
    • Improved plot generation to use raw clusters.
    • Streamlined annotation processes for somatic variants.
    • Reformatted command options for better readability in workflows.
  • Bug Fixes

    • Fixed issues with reading genomic data files.
    • Adjusted filtering logic in sample swap checks.
    • Removed outdated annotations and conditions from variant filtering scripts.
  • Refactor

    • Updated COWorkflowsBasePlugin to version 1.4.2.
    • Refactored file output and joining operations in analysis scripts.
    • Restructured annotation process for more accurate variant counting.
  • Documentation

    • Revised README.md to reflect new features and workflow adjustments.

@NagaComBio NagaComBio marked this pull request as ready for review March 12, 2021 10:55
@NagaComBio NagaComBio requested a review from vinjana March 12, 2021 10:55
Copy link
Member

@vinjana vinjana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite some things are changed. In particular the changes let the code behave differently (bias filtering) and some cases are not covered anymore (e.g. /Tumor_Somatic/)


print "\n$annotation_command\n";
my $runAnnotation = system($annotation_command);

if($runAnnotation != 0 ) {
`rm $jsonFile`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`rm $jsonFile`;
`rm '$jsonFile'`;

description="-dbtype option used for the ANNOVAR_BINARY (see above)."/>

<!--Reference annotations-->
<cvalue name="CONFIDENCE_OPTS_INDEL" value="--refgenome GRCh38 ftp://ftp.sanger.ac.uk/pub/cancer/dockstore/human/GRCh38_hla_decoy_ebv/core_ref_GRCh38_hla_decoy_ebv.tar.gz" type="string" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a description which values are allowed for refgenome?

@@ -213,14 +217,12 @@ def main(args):
penalties = ""
infos = []

if args.no_control:
if args.no_control or args.refgenome[0] == 'GRCh38' or args.skipREMAP:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Here "GRCh38" and above "hs37d5"? Wouldn't it be better to use the same identifier namespace, e.g. "GRCh37" und "GRCh38"?
  • Is the d5 important here?
  • What about using a functions is_hg37 and is_hg38 that even test multiple alternative identifiers (e.g. hg37, GRCh37, hs37d5 and GRCh38, hg38). This would also avaid that you have to repeat the strings multiple times.
  • At least (and in any case) it should be documented outside the code, which values are allowed.

The if/elses on the assembly name repeat multiple times. You may consider putting all code into a classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I have observed the inconsistency, but I have left hs37d5 in for legacy reasons and worked around it.
  • hs37d5 is like a version number for 1000 genome reference genome of GRCh37.
  • Yes, it could be improved, I will look into these.

Copy link
Member

@vinjana vinjana Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could for instance add two functions grch37 and grch38 that each return 'true' or 'false' (0/1 in Perl :P) on match to one or multiple equivalent identifiers. Then you could also match hg37, hg19, hg38, hs37d5, GRCh37, GRCh38, etc. all equivalently.

If you want to get rid of the if/elses completely you'd probably have to extract the branch-code into methods on two classes, e.g. GRCh37 and GRCh38, both implementing the Assembly interface, or so. Not sure whether you want to go that far.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked at the hg38 branch and saw that we had a related discussion about the used identifiers there. Probably the ideal would be to test these identifiers already in the Java/Groovy code, maybe even map from aliases to the canonical names used everywhere else in the code (e.g. hg19->hs37d5). However, maybe it is sufficient to have the identifiers documented well in the XML and maybe the README.

The room for improvement of the if/else spaghetti code remains a valid point, though.

README.md Outdated
@@ -83,6 +83,22 @@ TBD

# Changelist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there new workflow options, or were there options removed of which you think that they may have to be changed users (e.g. if the workflow is to be run with moved reference directories, dependent on the sample type, dependent on scientific question, dependent on sequencing-type (WES/WGS), etc.)?

Would you add a description for these options to the README and ensure they are properly documented in the XMLs (e.g. with a short description of the meaning, expected file content and format, default value).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider limiting the heap-size with this approach.

Copy link

coderabbitai bot commented Jan 24, 2024

Walkthrough

The update to version 4.0.0 brings substantial changes, including support for the hg38/GRCh38 reference genome, updates to the confidence scoring script, removal of certain annotations, and addition of new options like --skipREMAP and runlowmaf. The TiNDA-related processes have been adjusted. Additionally, the COWorkflowsBasePlugin is updated to 1.4.2. These changes reflect a shift towards more current genomic databases and a refinement of analysis tools for indel calling workflows.

Changes

File Path Change Summary
README.md, buildinfo.txt Updated to version 4.0.0, introducing hg38/GRCh38 support, --skipREMAP, and runlowmaf options, and updating COWorkflowsBasePlugin to 1.4.2.
buildversion.txt Semantically updated from version 3.1.1 to 4.0.0.
.../indelCallingWorkflow/TiN_CanopyBasedClustering.R Improved chr.length reading, added shiftLength column and raw filtering steps, updated plots and rescueInfo table to use raw clusters.
.../indelCallingWorkflow/checkSampleSwap_TiN.pl Overhauled parameters, paths, filtering settings, annotation process, somatic number counting, and added allele frequency parsing subroutine.
.../indelCallingWorkflow/confidenceAnnotation_Indels.py Introduced new GnomAD frequency filter, updated headers, refined confidence scoring logic, and added low MAF punishment options.
.../indelCallingWorkflow/environments/tbi-cluster_checkSampleSwap.sh Added a bash script for setting up the Indel Calling Workflow environment on a cluster.
.../indelCallingWorkflow/filter_vcf.sh Removed conditions related to FILTER_ExAC and FILTER_EVS, and added a condition related to FILTER_1KGENOMES.
.../indelCallingWorkflow/indelCalling.sh, .../indelCallingWorkflow/platypusIndelAnnotation.sh Reformatted commands for better readability, added sorting step for indels, and modified annotation pipeline for VCF files, replacing annotations related to certain databases with GNOMAD WES and GNOMAD WGS databases. Introduced new parameters related to maximum MAF thresholds for various databases.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit-tests for this file.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit tests for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository from git and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit tests.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 9

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 19ce429 and 61d12f4.
Files ignored due to path filters (7)
  • IndelCallingWorkflow.jar is excluded by: !**/*.jar
  • docs/images/denbi.png is excluded by: !**/*.png
  • resources/analysisTools/indelCallingWorkflow/environments/conda.yml is excluded by: !**/*.yml
  • resources/analysisTools/indelCallingWorkflow/readbiasfunctions.pyc is excluded by: !**/*.pyc
  • resources/analysisTools/indelCallingWorkflow/readbiasplots.pyc is excluded by: !**/*.pyc
  • resources/configurationFiles/analysisIndelCalling.xml is excluded by: !**/*.xml
  • resources/configurationFiles/analysisIndelCallingGRCh38.xml is excluded by: !**/*.xml
Files selected for processing (11)
  • README.md (1 hunks)
  • buildinfo.txt (1 hunks)
  • buildversion.txt (1 hunks)
  • resources/analysisTools/indelCallingWorkflow/TiN_CanopyBasedClustering.R (7 hunks)
  • resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl (10 hunks)
  • resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.sh (1 hunks)
  • resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py (12 hunks)
  • resources/analysisTools/indelCallingWorkflow/environments/tbi-cluster_checkSampleSwap.sh (1 hunks)
  • resources/analysisTools/indelCallingWorkflow/filter_vcf.sh (1 hunks)
  • resources/analysisTools/indelCallingWorkflow/indelCalling.sh (2 hunks)
  • resources/analysisTools/indelCallingWorkflow/platypusIndelAnnotation.sh (2 hunks)
Files skipped from review due to trivial changes (3)
  • buildversion.txt
  • resources/analysisTools/indelCallingWorkflow/environments/tbi-cluster_checkSampleSwap.sh
  • resources/analysisTools/indelCallingWorkflow/indelCalling.sh
Additional comments: 18
buildinfo.txt (1)
  • 1-1: The dependency version for COWorkflowsBasePlugin has been updated from 1.4.1 to 1.4.2.

This change is consistent with the PR objectives and AI-generated summary.

resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.sh (4)
  • 31-34: Variables related to genomic databases have been renamed to reflect the updated database versions.

The changes align with the PR objectives to support hg38 alongside hg19.

  • 39-39: The --chr_prefix variable is set using the CHR_PREFIX environment variable.

Assuming CHR_PREFIX is correctly set in the environment, this change is appropriate.

  • 42-42: The --chrLengthFile parameter has been updated to use ${CHROM_SIZES_FILE}.

This change is consistent with the PR objectives and AI-generated summary.

  • 50-50: Output file parameters have been renamed to reflect their new purpose.

Renaming these parameters clarifies their usage in the context of the updated workflow.

resources/analysisTools/indelCallingWorkflow/filter_vcf.sh (1)
  • 33-33: The FILTER_1KGENOMES condition has been added to the FILTER_VALUES construction.

This addition is consistent with the PR objectives to remove ExAC and EVS conditions and add support for 1KGENOMES.

resources/analysisTools/indelCallingWorkflow/platypusIndelAnnotation.sh (2)
  • 55-56: Annotations related to GNOMAD WES and WGS databases have been added.

This change aligns with the PR objectives to update the annotation pipeline.

  • 127-128: New parameters for maximum MAF thresholds have been introduced in the confidence annotation step.

The addition of these parameters is consistent with the PR objectives to refine the confidence annotation process.

resources/analysisTools/indelCallingWorkflow/TiN_CanopyBasedClustering.R (5)
  • 64-65: The chr.length data frame is now read using read_tsv and a new column shiftLength is added.

This change is consistent with the PR objectives to improve the reading of chromosome length data.

  • 127-129: A raw filtering step has been added using mutate to create a new column rawCluster.

This addition is consistent with the PR objectives to introduce a new raw filtering step for cluster analysis.

  • 191-196: Plots have been modified to use raw clusters instead of canopy clusters.

This change aligns with the PR objectives to adjust plotting functions to utilize raw clusters.

  • 230-243: The rescueInfo table creation has been refactored to include raw clustering information.

This refactoring is consistent with the PR objectives to include raw clustering information in the rescueInfo table.

  • 287-291: Minor adjustments have been made to file output and joining operations.

These adjustments are consistent with the PR objectives to refine the script.

README.md (1)
  • 96-110: The changelog in the README.md file provides a clear summary of the major, minor, and patch changes for the version update to 4.0.0. It is important to ensure that all mentioned changes are reflected in the updated code and documentation.
resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl (4)
  • 21-25: The script parameters have been updated to remove $biasScript and add $chr_prefix, $outfile_tindaVCF, and $outfile_SJ. Ensure that all references to the removed parameter have been updated accordingly and that the new parameters are used correctly throughout the script.
Verification successful

The verification process confirms that $biasScript is no longer used in the checkSampleSwap_TiN.pl script. The new parameters $chr_prefix, $outfile_tindaVCF, and $outfile_SJ have been added and are being used correctly within the script.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify that $biasScript is no longer used in the script.
rg --type perl '$biasScript' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl
# Verify that the new parameters are used correctly.
ast-grep --lang perl --pattern $'my ($$$, $chr_prefix, $$$, $outfile_tindaVCF, $$$, $outfile_SJ) = $$$;' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl

Length of output: 381



Script:

#!/bin/bash
# Confirm that $biasScript is not used in the script.
rg --type perl '\$biasScript' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl

# Verify that the new parameters $chr_prefix, $outfile_tindaVCF, and $outfile_SJ are used correctly in the script.
rg --type perl '\$chr_prefix' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl
rg --type perl '\$outfile_tindaVCF' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl
rg --type perl '\$outfile_SJ' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl

Length of output: 982

* 396-416: The script has been updated to count the number of somatic variants. Ensure that the logic for counting and classifying somatic variants is consistent with the updated workflow requirements and that the counts are being used appropriately downstream.
Verification successful

The verification process has confirmed the presence of the patterns Tumor_Somatic_Common, Tumor_Somatic_Rare, Control_Somatic_Common, and Control_Somatic_Rare in the checkSampleSwap_TiN.pl script. This indicates that the script contains logic to count somatic variants based on these categories, which aligns with the provided code snippet.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify that the logic for counting somatic variants is consistent with the workflow requirements.
ast-grep --lang perl --pattern $'if ($$$ =~ /Tumor_Somatic_Common/) { $$$ }' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl
ast-grep --lang perl --pattern $'if ($$$ =~ /Tumor_Somatic_Rare/) { $$$ }' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl
ast-grep --lang perl --pattern $'if ($$$ =~ /Control_Somatic_Common/) { $$$ }' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl
ast-grep --lang perl --pattern $'if ($$$ =~ /Control_Somatic_Rare/) { $$$ }' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl

Length of output: 1019



Script:

#!/bin/bash
# Use ripgrep to verify the logic for counting somatic variants in the Perl script.
rg 'if.*Tumor_Somatic_Common.*\{.*\}' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl
rg 'if.*Tumor_Somatic_Rare.*\{.*\}' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl
rg 'if.*Control_Somatic_Common.*\{.*\}' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl
rg 'if.*Control_Somatic_Rare.*\{.*\}' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl

Length of output: 431



Script:

#!/bin/bash
# Use ripgrep with a more lenient pattern to verify the logic for counting somatic variants in the Perl script.
rg 'Tumor_Somatic_Common' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl
rg 'Tumor_Somatic_Rare' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl
rg 'Control_Somatic_Common' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl
rg 'Control_Somatic_Rare' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl

Length of output: 541

* 453-453: The script includes commands to remove intermediate files. Ensure that these cleanup operations are necessary and that no data required for downstream processes is being deleted.
Verification successful

The search results indicate that the filenames constructed with snvs_.*\.GTfiltered_gnomAD\.SomaticIn\.vcf and snvs_.*\.GTfiltered_gnomAD\.Germline\.Rare\.txt patterns do not appear elsewhere in the codebase, except for their definitions in checkSampleSwap_TiN.pl. This suggests that these files are not referenced in other parts of the codebase and are likely safe to delete as part of cleanup operations.

However, the pattern snvs_.*\.GTfiltered_gnomAD\.Germline\.Rare\.vcf is found in the checkSampleSwap_TiN.pl script, which is the same file where the deletion command is issued. This is expected as it is the definition of the variable that holds the filename.

Since no other references to these files were found in the codebase, it can be inferred that the cleanup operation in the checkSampleSwap_TiN.pl script does not remove data required for downstream processes within the scope of the codebase we have access to.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Verify that the removed files are not required for downstream processes.
rg --type perl '$snvsGT_RawFile' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl
rg --type perl '$snvsGT_gnomADFile' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl

Length of output: 208



Script:

#!/bin/bash
# Find where the variables $snvsGT_somatic, $snvsGT_germlineRare, and $snvsGT_germlineRare_txt are assigned values
rg --type perl '\$snvsGT_somatic\s*=' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl
rg --type perl '\$snvsGT_germlineRare\s*=' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl
rg --type perl '\$snvsGT_germlineRare_txt\s*=' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl

# Assuming the above commands give us the filenames, we will then search for these filenames across the codebase.
# Placeholder for searching the actual filenames across the codebase:
# rg 'actual_filename_of_$snvsGT_somatic'
# rg 'actual_filename_of_$snvsGT_germlineRare'
# rg 'actual_filename_of_$snvsGT_germlineRare_txt'

Length of output: 642



Script:

#!/bin/bash
# Search the entire codebase for references to the filenames identified.
rg 'snvs_.*\.GTfiltered_gnomAD\.SomaticIn\.vcf'
rg 'snvs_.*\.GTfiltered_gnomAD\.Germline\.Rare\.vcf'
rg 'snvs_.*\.GTfiltered_gnomAD\.Germline\.Rare\.txt'

Length of output: 668

* 472-486: > 📝 **NOTE** > This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [456-486]

The subroutine parse_AF is introduced to parse allele frequencies from annotation matches. Ensure that this subroutine is robust and handles all expected input formats correctly. Additionally, the median function is added; verify that it is used correctly and that its results are handled appropriately.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 3

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 61d12f4 and e4f9c9b.
Files selected for processing (2)
  • resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl (10 hunks)
  • resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py (13 hunks)
Additional comments: 16
resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl (7)
  • 21-25: The parameters $biasScript, $outfile_RG, $outfile_SR, and $outfile_AS have been removed, and new parameters $chr_prefix, $outfile_tindaVCF, and $outfile_SJ have been added. Ensure that the removal of parameters does not affect any downstream processes and that the new parameters are properly integrated and used within the script.
  • 453-463: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [460-479]

A new subroutine parse_AF has been added to parse allele frequency from annotation lines. Ensure that this subroutine is robust against different formats of allele frequency data that may be encountered in the annotation lines.

  • 480-490: A median function has been added to calculate the median allele frequency. Verify that this function is called with the correct data and that it handles edge cases properly, such as when the list of values is empty or contains only one element.
  • 193-193: The filtering logic has been updated to include a new chromosome prefix variable $chr_prefix. Confirm that this variable is being set correctly and that the updated regex pattern correctly matches the intended chromosome names.
  • 244-248: The annotation command has been updated to include new databases for annotation and remove some previous ones. Ensure that the new annotation sources are correctly formatted and accessible, and that the removal of the old sources does not negatively impact the analysis.
  • 432-445: The script calculates percentages for various metrics and stores them in a JSON file. Verify that the calculations are correct and that division by zero is handled appropriately.
  • 457-457: File cleanup operations have been updated to remove additional files. Confirm that these files are no longer needed and that their removal does not affect any subsequent steps in the workflow.
resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py (9)
  • 55-67: The get_fixed_headers function has been modified to conditionally include headers based on the args.refgenome and args.no_control parameters. This is a good use of conditional logic to adapt the headers to the context of the analysis.
  • 70-87: The get_variable_headers function uses a dictionary to map header names to regular expressions. This is a maintainable approach that allows for easy updates to header names.
  • 89-98: The check_max_maf function is a new addition that checks if any allele frequency in a given column exceeds a maximum threshold. This is a critical function for filtering variants based on frequency data and is a good example of encapsulating logic into a reusable component.
  • 163-163: The addition of a new filter FREQ in the VCF header is consistent with the PR objectives to introduce new filtering criteria based on allele frequency data.
  • 179-180: The variables fixed_headers and variable_headers are set using the newly added functions. This is a good example of using functions to modularize code and make it more readable.
  • 283-299: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [257-296]

The logic for setting flags based on various databases and allele frequencies has been refactored. This is a significant change that aligns with the PR objectives to adjust the annotation based on the MAF filter value from the XML.

  • 357-385: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [313-368]

The confidence scoring logic has been modified, and while the previous review comments suggested refactoring this complex logic, the developer has chosen not to do so in this PR. The changes made here seem to be in line with the new allele frequency filtering logic.

  • 478-536: The region confidence scoring logic has been updated to include new conditions. This is a complex section of code that could benefit from further refactoring, but it appears to be correctly implemented based on the PR objectives.
  • 641-657: New command-line arguments have been added to control the maximum allele frequencies for different databases and the option to run low MAF punishment. This is consistent with the PR objectives and provides users with more control over the filtering process.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between e4f9c9b and cdd1988.
Files selected for processing (1)
  • resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py (13 hunks)
Additional comments: 6
resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py (6)
  • 55-67: The get_fixed_headers function dynamically adjusts headers based on the reference genome and control data presence. This is a good practice for maintainability and flexibility. However, ensure that all potential headers are accounted for in downstream processing.
  • 70-87: The get_variable_headers function effectively manages variable headers based on conditions such as the reference genome used and the presence of control data. This modular approach enhances code readability and maintainability.
  • 89-98: The check_max_maf function correctly identifies if any allele frequency exceeds the maximum allowed. This is crucial for filtering variants based on population frequency data. Ensure that the help dictionary structure and column_valid_key usage are consistent throughout the script.
  • 45-103: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [100-114]

The header generation in the main function is well-structured, incorporating both fixed and additional headers. This approach allows for flexible header management. Ensure that all headers are correctly documented and used consistently in downstream analyses.

  • 179-180: The retrieval of fixed and variable headers using get_fixed_headers and get_variable_headers functions is correctly implemented. This ensures that the script dynamically adjusts to the input data's specifics, such as the presence of control samples and the reference genome used.
  • 639-655: The argument parsing section is comprehensive, providing a wide range of options for customizing the script's behavior. Ensure that the help messages are clear and informative for all users, particularly for complex options like --hetcontr and --homcontr.

Comment on lines 48 to 67
## Helper functions
def is_hg37(args):
return args.refgenome[0] == "hs37d5"

def is_hg38(args):
return args.refgenome[0] == "GRCh38"

def get_fixed_headers(args):
fixed_headers = ["^QUAL$", "^INFO$", "^FILTER$", "MAPABILITY", "SIMPLE_TANDEMREPEATS",
"REPEAT_MASKER", "^CONFIDENCE$", "^CLASSIFICATION$", "^REGION_CONFIDENCE$",
"^PENALTIES$", "^REASONS$"]

hs37d5_headers = ["DAC_BLACKLIST", "DUKE_EXCLUDED", "HISEQDEPTH", "SELFCHAIN"]
if is_hg37(args):
fixed_headers += hs37d5_headers

if not args.no_control:
fixed_headers += ["^INFO_control", "^ANNOTATION_control$"]

return fixed_headers
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of is_hg37 and is_hg38 relies on hard-coded reference genome names. This approach is not scalable or flexible for future reference genomes. Consider adopting a more dynamic method for genome version checks.

- return args.refgenome[0] == "hs37d5"
- return args.refgenome[0] == "GRCh38"
+ return args.refgenome[0].startswith("hs37")
+ return args.refgenome[0].startswith("GRCh38")

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
## Helper functions
def is_hg37(args):
return args.refgenome[0] == "hs37d5"
def is_hg38(args):
return args.refgenome[0] == "GRCh38"
def get_fixed_headers(args):
fixed_headers = ["^QUAL$", "^INFO$", "^FILTER$", "MAPABILITY", "SIMPLE_TANDEMREPEATS",
"REPEAT_MASKER", "^CONFIDENCE$", "^CLASSIFICATION$", "^REGION_CONFIDENCE$",
"^PENALTIES$", "^REASONS$"]
hs37d5_headers = ["DAC_BLACKLIST", "DUKE_EXCLUDED", "HISEQDEPTH", "SELFCHAIN"]
if is_hg37(args):
fixed_headers += hs37d5_headers
if not args.no_control:
fixed_headers += ["^INFO_control", "^ANNOTATION_control$"]
return fixed_headers
## Helper functions
def is_hg37(args):
return args.refgenome[0].startswith("hs37")
def is_hg38(args):
return args.refgenome[0].startswith("GRCh38")
def get_fixed_headers(args):
fixed_headers = ["^QUAL$", "^INFO$", "^FILTER$", "MAPABILITY", "SIMPLE_TANDEMREPEATS",
"REPEAT_MASKER", "^CONFIDENCE$", "^CLASSIFICATION$", "^REGION_CONFIDENCE$",
"^PENALTIES$", "^REASONS$"]
hs37d5_headers = ["DAC_BLACKLIST", "DUKE_EXCLUDED", "HISEQDEPTH", "SELFCHAIN"]
if is_hg37(args):
fixed_headers += hs37d5_headers
if not args.no_control:
fixed_headers += ["^INFO_control", "^ANNOTATION_control$"]
return fixed_headers

Comment on lines 473 to 537
region_conf -= 1
reasons += "SelfchainAndOrSegdup(-1)"

if any(word in help["REPEAT_MASKER"] for word in ["Simple_repeat", "Low_", "Satellite", ]) or help["SIMPLE_TANDEMREPEATS_VALID"]:
region_conf -= 2
reasons += "Repeat(-2)"
# other repeat elements to penalize at least a bit
elif help["REPEAT_MASKER_VALID"]:
region_conf -= 1
reasons += "Other_repeat(-1)"

# Mapability is 1 for unique regions, 0.5 for regions appearing twice, 0.33... 3times, ...
# Everything with really high number of occurences is artefacts
# does not always correlate with the above regions
# is overestimating badness bc. of _single_ end read simulations
if help["MAPABILITY"] == ".":
# in very rare cases (CEN), there is no mapability => ".", which is not numeric but interpreted as 0
region_conf -= 5
reasons += "Not_mappable(-5)"
else:
reduce = 0
# can have several entries for indels, e.g. 0.5&0.25 - take worst (lowest) or best (highest)?!
mapability = min(map(float, help["MAPABILITY"].split("&"))) # just chose one - here: min
if mapability < 0.5:
# 0.5 does not seem to be that bad: region appears another time in
# the genome and we have paired end data!

## DUKE, DAC, Hiseq, Self chain are only available for hg19 reference genome
if args.refgenome[0] == "hs37d5" and not args.skipREMAP:

if help["DUKE_EXCLUDED_VALID"] or help["DAC_BLACKLIST_VALID"] or help["HISEQDEPTH_VALID"]:
region_conf -= 3 # really bad region, usually centromeric repeats
reasons += "Blacklist(-3)"

if help["ANNOVAR_SEGDUP_COL_VALID"] or help["SELFCHAIN_VALID"]:
region_conf -= 1
reduce += 1
reasons += "SelfchainAndOrSegdup(-1)"

#if mapability < 0.4: # 3-4 times appearing region is worse but still not too bad
# region_conf -= 1
# reduce += 1
if help["ANNOVAR_SEGDUP_COL_VALID"]:
region_conf -= 1
reasons += "Segdup(-1)"

if mapability < 0.25: # > 4 times appearing region
if any(word in help["REPEAT_MASKER"] for word in ["Simple_repeat", "Low_", "Satellite", ]) or help["SIMPLE_TANDEMREPEATS_VALID"]:
region_conf -= 2
reasons += "Repeat(-2)"
# other repeat elements to penalize at least a bit
elif help["REPEAT_MASKER_VALID"]:
region_conf -= 1
reasons += "Other_repeat(-1)"

# Mapability is 1 for unique regions, 0.5 for regions appearing twice, 0.33... 3times, ...
# Everything with really high number of occurences is artefacts
# does not always correlate with the above regions
# is overestimating badness bc. of _single_ end read simulations
if help["MAPABILITY"] == ".":
# in very rare cases (CEN), there is no mapability => ".", which is not numeric but interpreted as 0
region_conf -= 5
reasons += "Not_mappable(-5)"
else:
reduce = 0
# can have several entries for indels, e.g. 0.5&0.25 - take worst (lowest) or best (highest)?!
mapability = min(map(float, help["MAPABILITY"].split("&"))) # just chose one - here: min
if mapability < 0.5:
# 0.5 does not seem to be that bad: region appears another time in
# the genome and we have paired end data!
region_conf -= 1
reduce += 1

if mapability < 0.1: # > 5 times is bad
region_conf -= 2
reduce += 2
#if mapability < 0.4: # 3-4 times appearing region is worse but still not too bad
# region_conf -= 1
# reduce += 1

if mapability < 0.05: # these regions are clearly very bad (Lego stacks)
region_conf -= 3
reduce += 3
if mapability < 0.25: # > 4 times appearing region
region_conf -= 1
reduce += 1

if mapability < 0.1: # > 5 times is bad
region_conf -= 2
reduce += 2

if mapability < 0.05: # these regions are clearly very bad (Lego stacks)
region_conf -= 3
reduce += 3

reasons += "Low_mappability(%s=>-%d)"%(help["MAPABILITY"], reduce)
reasons += "Low_mappability(%s=>-%d)"%(help["MAPABILITY"], reduce)

if classification != "somatic" and not "PASS" in help["FILTER"]:
# such an indel is probably also "unclear" and not "germline"
Copy link

@coderabbitai coderabbitai bot Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [257-534]

The variant classification and confidence scoring logic is extensive and handles a wide range of conditions. While the logic appears correct, the extensive use of magic numbers and complex conditional structures makes this section hard to maintain. Consider refactoring for clarity and maintainability, possibly by breaking down into smaller functions or using a more descriptive approach for magic numbers.

- confidence -= 1
+ confidence -= CONFIDENCE_DECREMENT_FOR_CONDITION

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not working on this logic in this PR.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NagaComBio Understood, thank you for the clarification. If there's anything else you need assistance with, feel free to ask!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 1

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between cdd1988 and 4a2b713.
Files selected for processing (1)
  • resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py (13 hunks)
Additional comments: 3
resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py (3)
  • 48-70: The addition of helper functions is_hg37, is_hg38, get_fixed_headers, and get_variable_headers improves modularity and readability by encapsulating specific logic for determining the reference genome and generating header lists based on conditions. This approach aligns well with best practices for maintainability and code clarity.
  • 91-100: The check_max_maf function introduces a streamlined way to filter variants based on allele frequency, enhancing the script's flexibility in handling different databases and their respective maximum allele frequency thresholds. This function is a valuable addition for maintaining the accuracy and specificity of variant annotation.
  • 475-539: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [102-658]

The main function has been significantly updated to incorporate new command-line arguments for controlling maximum allele frequencies and to utilize the newly added helper functions for conditional logic based on the reference genome and control data presence. These changes enhance the script's functionality and adaptability to different datasets and analysis requirements. However, ensure that all new command-line arguments are documented and validated to prevent potential user errors.

Comment on lines 475 to 539
region_conf -= 1
reasons += "SelfchainAndOrSegdup(-1)"

if any(word in help["REPEAT_MASKER"] for word in ["Simple_repeat", "Low_", "Satellite", ]) or help["SIMPLE_TANDEMREPEATS_VALID"]:
region_conf -= 2
reasons += "Repeat(-2)"
# other repeat elements to penalize at least a bit
elif help["REPEAT_MASKER_VALID"]:
region_conf -= 1
reasons += "Other_repeat(-1)"

# Mapability is 1 for unique regions, 0.5 for regions appearing twice, 0.33... 3times, ...
# Everything with really high number of occurences is artefacts
# does not always correlate with the above regions
# is overestimating badness bc. of _single_ end read simulations
if help["MAPABILITY"] == ".":
# in very rare cases (CEN), there is no mapability => ".", which is not numeric but interpreted as 0
region_conf -= 5
reasons += "Not_mappable(-5)"
else:
reduce = 0
# can have several entries for indels, e.g. 0.5&0.25 - take worst (lowest) or best (highest)?!
mapability = min(map(float, help["MAPABILITY"].split("&"))) # just chose one - here: min
if mapability < 0.5:
# 0.5 does not seem to be that bad: region appears another time in
# the genome and we have paired end data!

## DUKE, DAC, Hiseq, Self chain are only available for hg19 reference genome
if args.refgenome[0] == "hs37d5" and not args.skipREMAP:

if help["DUKE_EXCLUDED_VALID"] or help["DAC_BLACKLIST_VALID"] or help["HISEQDEPTH_VALID"]:
region_conf -= 3 # really bad region, usually centromeric repeats
reasons += "Blacklist(-3)"

if help["ANNOVAR_SEGDUP_COL_VALID"] or help["SELFCHAIN_VALID"]:
region_conf -= 1
reduce += 1
reasons += "SelfchainAndOrSegdup(-1)"

#if mapability < 0.4: # 3-4 times appearing region is worse but still not too bad
# region_conf -= 1
# reduce += 1
if help["ANNOVAR_SEGDUP_COL_VALID"]:
region_conf -= 1
reasons += "Segdup(-1)"

if mapability < 0.25: # > 4 times appearing region
if any(word in help["REPEAT_MASKER"] for word in ["Simple_repeat", "Low_", "Satellite", ]) or help["SIMPLE_TANDEMREPEATS_VALID"]:
region_conf -= 2
reasons += "Repeat(-2)"
# other repeat elements to penalize at least a bit
elif help["REPEAT_MASKER_VALID"]:
region_conf -= 1
reasons += "Other_repeat(-1)"

# Mapability is 1 for unique regions, 0.5 for regions appearing twice, 0.33... 3times, ...
# Everything with really high number of occurences is artefacts
# does not always correlate with the above regions
# is overestimating badness bc. of _single_ end read simulations
if help["MAPABILITY"] == ".":
# in very rare cases (CEN), there is no mapability => ".", which is not numeric but interpreted as 0
region_conf -= 5
reasons += "Not_mappable(-5)"
else:
reduce = 0
# can have several entries for indels, e.g. 0.5&0.25 - take worst (lowest) or best (highest)?!
mapability = min(map(float, help["MAPABILITY"].split("&"))) # just chose one - here: min
if mapability < 0.5:
# 0.5 does not seem to be that bad: region appears another time in
# the genome and we have paired end data!
region_conf -= 1
reduce += 1

if mapability < 0.1: # > 5 times is bad
region_conf -= 2
reduce += 2
#if mapability < 0.4: # 3-4 times appearing region is worse but still not too bad
# region_conf -= 1
# reduce += 1

if mapability < 0.05: # these regions are clearly very bad (Lego stacks)
region_conf -= 3
reduce += 3
if mapability < 0.25: # > 4 times appearing region
region_conf -= 1
reduce += 1

if mapability < 0.1: # > 5 times is bad
region_conf -= 2
reduce += 2

if mapability < 0.05: # these regions are clearly very bad (Lego stacks)
region_conf -= 3
reduce += 3

reasons += "Low_mappability(%s=>-%d)"%(help["MAPABILITY"], reduce)
reasons += "Low_mappability(%s=>-%d)"%(help["MAPABILITY"], reduce)

if classification != "somatic" and not "PASS" in help["FILTER"]:
# such an indel is probably also "unclear" and not "germline"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [48-658]

While the script has been enhanced with new functionalities and improved modularity through the addition of helper functions, it's important to continuously evaluate the complexity of the variant annotation logic. Consider future refactoring opportunities to further simplify and decompose complex logic into smaller, more manageable functions. This will aid in maintainability and ease of understanding for new contributors to the project.

@NagaComBio NagaComBio requested a review from vinjana February 23, 2024 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants