-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hg19 and hg38 from same workflow version #19
base: master
Are you sure you want to change the base?
Conversation
…tion of analysisIndelCalling.xml with paths to GRCh38 directories, Changing tool code that required data that is not available for GRCh38.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite some things are changed. In particular the changes let the code behave differently (bias filtering) and some cases are not covered anymore (e.g. /Tumor_Somatic/
)
resources/analysisTools/indelCallingWorkflow/TiN_CanopyBasedClustering.R
Show resolved
Hide resolved
resources/analysisTools/indelCallingWorkflow/TiN_CanopyBasedClustering.R
Show resolved
Hide resolved
resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl
Outdated
Show resolved
Hide resolved
resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl
Outdated
Show resolved
Hide resolved
|
||
print "\n$annotation_command\n"; | ||
my $runAnnotation = system($annotation_command); | ||
|
||
if($runAnnotation != 0 ) { | ||
`rm $jsonFile`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`rm $jsonFile`; | |
`rm '$jsonFile'`; |
description="-dbtype option used for the ANNOVAR_BINARY (see above)."/> | ||
|
||
<!--Reference annotations--> | ||
<cvalue name="CONFIDENCE_OPTS_INDEL" value="--refgenome GRCh38 ftp://ftp.sanger.ac.uk/pub/cancer/dockstore/human/GRCh38_hla_decoy_ebv/core_ref_GRCh38_hla_decoy_ebv.tar.gz" type="string" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a description which values are allowed for refgenome?
@@ -213,14 +217,12 @@ def main(args): | |||
penalties = "" | |||
infos = [] | |||
|
|||
if args.no_control: | |||
if args.no_control or args.refgenome[0] == 'GRCh38' or args.skipREMAP: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Here "GRCh38" and above "hs37d5"? Wouldn't it be better to use the same identifier namespace, e.g. "GRCh37" und "GRCh38"?
- Is the d5 important here?
- What about using a functions
is_hg37
andis_hg38
that even test multiple alternative identifiers (e.g. hg37, GRCh37, hs37d5 and GRCh38, hg38). This would also avaid that you have to repeat the strings multiple times. - At least (and in any case) it should be documented outside the code, which values are allowed.
The if/elses on the assembly name repeat multiple times. You may consider putting all code into a classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I have observed the inconsistency, but I have left
hs37d5
in for legacy reasons and worked around it. hs37d5
is like a version number for 1000 genome reference genome of GRCh37.- Yes, it could be improved, I will look into these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could for instance add two functions grch37 and grch38 that each return 'true' or 'false' (0/1 in Perl :P) on match to one or multiple equivalent identifiers. Then you could also match hg37, hg19, hg38, hs37d5, GRCh37, GRCh38, etc. all equivalently.
If you want to get rid of the if/elses completely you'd probably have to extract the branch-code into methods on two classes, e.g. GRCh37
and GRCh38
, both implementing the Assembly
interface, or so. Not sure whether you want to go that far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just looked at the hg38 branch and saw that we had a related discussion about the used identifiers there. Probably the ideal would be to test these identifiers already in the Java/Groovy code, maybe even map from aliases to the canonical names used everywhere else in the code (e.g. hg19->hs37d5). However, maybe it is sufficient to have the identifiers documented well in the XML and maybe the README.
The room for improvement of the if/else spaghetti code remains a valid point, though.
README.md
Outdated
@@ -83,6 +83,22 @@ TBD | |||
|
|||
# Changelist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there new workflow options, or were there options removed of which you think that they may have to be changed users (e.g. if the workflow is to be run with moved reference directories, dependent on the sample type, dependent on scientific question, dependent on sequencing-type (WES/WGS), etc.)?
Would you add a description for these options to the README and ensure they are properly documented in the XMLs (e.g. with a short description of the meaning, expected file content and format, default value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider limiting the heap-size with this approach.
Conflicts: README.md buildversion.txt
WalkthroughThe update to version 4.0.0 brings substantial changes, including support for the hg38/GRCh38 reference genome, updates to the confidence scoring script, removal of certain annotations, and addition of new options like Changes
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 9
Configuration used: CodeRabbit UI
Files ignored due to path filters (7)
IndelCallingWorkflow.jar
is excluded by:!**/*.jar
docs/images/denbi.png
is excluded by:!**/*.png
resources/analysisTools/indelCallingWorkflow/environments/conda.yml
is excluded by:!**/*.yml
resources/analysisTools/indelCallingWorkflow/readbiasfunctions.pyc
is excluded by:!**/*.pyc
resources/analysisTools/indelCallingWorkflow/readbiasplots.pyc
is excluded by:!**/*.pyc
resources/configurationFiles/analysisIndelCalling.xml
is excluded by:!**/*.xml
resources/configurationFiles/analysisIndelCallingGRCh38.xml
is excluded by:!**/*.xml
Files selected for processing (11)
- README.md (1 hunks)
- buildinfo.txt (1 hunks)
- buildversion.txt (1 hunks)
- resources/analysisTools/indelCallingWorkflow/TiN_CanopyBasedClustering.R (7 hunks)
- resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl (10 hunks)
- resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.sh (1 hunks)
- resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py (12 hunks)
- resources/analysisTools/indelCallingWorkflow/environments/tbi-cluster_checkSampleSwap.sh (1 hunks)
- resources/analysisTools/indelCallingWorkflow/filter_vcf.sh (1 hunks)
- resources/analysisTools/indelCallingWorkflow/indelCalling.sh (2 hunks)
- resources/analysisTools/indelCallingWorkflow/platypusIndelAnnotation.sh (2 hunks)
Files skipped from review due to trivial changes (3)
- buildversion.txt
- resources/analysisTools/indelCallingWorkflow/environments/tbi-cluster_checkSampleSwap.sh
- resources/analysisTools/indelCallingWorkflow/indelCalling.sh
Additional comments: 18
buildinfo.txt (1)
- 1-1: The dependency version for
COWorkflowsBasePlugin
has been updated from 1.4.1 to 1.4.2.This change is consistent with the PR objectives and AI-generated summary.
resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.sh (4)
- 31-34: Variables related to genomic databases have been renamed to reflect the updated database versions.
The changes align with the PR objectives to support hg38 alongside hg19.
- 39-39: The
--chr_prefix
variable is set using theCHR_PREFIX
environment variable.Assuming
CHR_PREFIX
is correctly set in the environment, this change is appropriate.
- 42-42: The
--chrLengthFile
parameter has been updated to use${CHROM_SIZES_FILE}
.This change is consistent with the PR objectives and AI-generated summary.
- 50-50: Output file parameters have been renamed to reflect their new purpose.
Renaming these parameters clarifies their usage in the context of the updated workflow.
resources/analysisTools/indelCallingWorkflow/filter_vcf.sh (1)
- 33-33: The
FILTER_1KGENOMES
condition has been added to theFILTER_VALUES
construction.This addition is consistent with the PR objectives to remove ExAC and EVS conditions and add support for 1KGENOMES.
resources/analysisTools/indelCallingWorkflow/platypusIndelAnnotation.sh (2)
- 55-56: Annotations related to GNOMAD WES and WGS databases have been added.
This change aligns with the PR objectives to update the annotation pipeline.
- 127-128: New parameters for maximum MAF thresholds have been introduced in the confidence annotation step.
The addition of these parameters is consistent with the PR objectives to refine the confidence annotation process.
resources/analysisTools/indelCallingWorkflow/TiN_CanopyBasedClustering.R (5)
- 64-65: The
chr.length
data frame is now read usingread_tsv
and a new columnshiftLength
is added.This change is consistent with the PR objectives to improve the reading of chromosome length data.
- 127-129: A raw filtering step has been added using
mutate
to create a new columnrawCluster
.This addition is consistent with the PR objectives to introduce a new raw filtering step for cluster analysis.
- 191-196: Plots have been modified to use raw clusters instead of canopy clusters.
This change aligns with the PR objectives to adjust plotting functions to utilize raw clusters.
- 230-243: The
rescueInfo
table creation has been refactored to include raw clustering information.This refactoring is consistent with the PR objectives to include raw clustering information in the
rescueInfo
table.
- 287-291: Minor adjustments have been made to file output and joining operations.
These adjustments are consistent with the PR objectives to refine the script.
README.md (1)
- 96-110: The changelog in the README.md file provides a clear summary of the major, minor, and patch changes for the version update to 4.0.0. It is important to ensure that all mentioned changes are reflected in the updated code and documentation.
resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl (4)
- 21-25: The script parameters have been updated to remove
$biasScript
and add$chr_prefix
,$outfile_tindaVCF
, and$outfile_SJ
. Ensure that all references to the removed parameter have been updated accordingly and that the new parameters are used correctly throughout the script.Verification successful
The verification process confirms that
$biasScript
is no longer used in thecheckSampleSwap_TiN.pl
script. The new parameters$chr_prefix
,$outfile_tindaVCF
, and$outfile_SJ
have been added and are being used correctly within the script.* 396-416: The script has been updated to count the number of somatic variants. Ensure that the logic for counting and classifying somatic variants is consistent with the updated workflow requirements and that the counts are being used appropriately downstream.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify that $biasScript is no longer used in the script. rg --type perl '$biasScript' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl # Verify that the new parameters are used correctly. ast-grep --lang perl --pattern $'my ($$$, $chr_prefix, $$$, $outfile_tindaVCF, $$$, $outfile_SJ) = $$$;' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.plLength of output: 381
Script:
#!/bin/bash # Confirm that $biasScript is not used in the script. rg --type perl '\$biasScript' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl # Verify that the new parameters $chr_prefix, $outfile_tindaVCF, and $outfile_SJ are used correctly in the script. rg --type perl '\$chr_prefix' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl rg --type perl '\$outfile_tindaVCF' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl rg --type perl '\$outfile_SJ' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.plLength of output: 982
Verification successful
The verification process has confirmed the presence of the patterns
Tumor_Somatic_Common
,Tumor_Somatic_Rare
,Control_Somatic_Common
, andControl_Somatic_Rare
in thecheckSampleSwap_TiN.pl
script. This indicates that the script contains logic to count somatic variants based on these categories, which aligns with the provided code snippet.* 453-453: The script includes commands to remove intermediate files. Ensure that these cleanup operations are necessary and that no data required for downstream processes is being deleted.Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify that the logic for counting somatic variants is consistent with the workflow requirements. ast-grep --lang perl --pattern $'if ($$$ =~ /Tumor_Somatic_Common/) { $$$ }' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl ast-grep --lang perl --pattern $'if ($$$ =~ /Tumor_Somatic_Rare/) { $$$ }' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl ast-grep --lang perl --pattern $'if ($$$ =~ /Control_Somatic_Common/) { $$$ }' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl ast-grep --lang perl --pattern $'if ($$$ =~ /Control_Somatic_Rare/) { $$$ }' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.plLength of output: 1019
Script:
#!/bin/bash # Use ripgrep to verify the logic for counting somatic variants in the Perl script. rg 'if.*Tumor_Somatic_Common.*\{.*\}' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl rg 'if.*Tumor_Somatic_Rare.*\{.*\}' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl rg 'if.*Control_Somatic_Common.*\{.*\}' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl rg 'if.*Control_Somatic_Rare.*\{.*\}' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.plLength of output: 431
Script:
#!/bin/bash # Use ripgrep with a more lenient pattern to verify the logic for counting somatic variants in the Perl script. rg 'Tumor_Somatic_Common' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl rg 'Tumor_Somatic_Rare' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl rg 'Control_Somatic_Common' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl rg 'Control_Somatic_Rare' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.plLength of output: 541
Verification successful
The search results indicate that the filenames constructed with
snvs_.*\.GTfiltered_gnomAD\.SomaticIn\.vcf
andsnvs_.*\.GTfiltered_gnomAD\.Germline\.Rare\.txt
patterns do not appear elsewhere in the codebase, except for their definitions incheckSampleSwap_TiN.pl
. This suggests that these files are not referenced in other parts of the codebase and are likely safe to delete as part of cleanup operations.However, the pattern
snvs_.*\.GTfiltered_gnomAD\.Germline\.Rare\.vcf
is found in thecheckSampleSwap_TiN.pl
script, which is the same file where the deletion command is issued. This is expected as it is the definition of the variable that holds the filename.Since no other references to these files were found in the codebase, it can be inferred that the cleanup operation in the
checkSampleSwap_TiN.pl
script does not remove data required for downstream processes within the scope of the codebase we have access to.* 472-486: > 📝 **NOTE** > This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [456-486]Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Verify that the removed files are not required for downstream processes. rg --type perl '$snvsGT_RawFile' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl rg --type perl '$snvsGT_gnomADFile' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.plLength of output: 208
Script:
#!/bin/bash # Find where the variables $snvsGT_somatic, $snvsGT_germlineRare, and $snvsGT_germlineRare_txt are assigned values rg --type perl '\$snvsGT_somatic\s*=' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl rg --type perl '\$snvsGT_germlineRare\s*=' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl rg --type perl '\$snvsGT_germlineRare_txt\s*=' resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl # Assuming the above commands give us the filenames, we will then search for these filenames across the codebase. # Placeholder for searching the actual filenames across the codebase: # rg 'actual_filename_of_$snvsGT_somatic' # rg 'actual_filename_of_$snvsGT_germlineRare' # rg 'actual_filename_of_$snvsGT_germlineRare_txt'Length of output: 642
Script:
#!/bin/bash # Search the entire codebase for references to the filenames identified. rg 'snvs_.*\.GTfiltered_gnomAD\.SomaticIn\.vcf' rg 'snvs_.*\.GTfiltered_gnomAD\.Germline\.Rare\.vcf' rg 'snvs_.*\.GTfiltered_gnomAD\.Germline\.Rare\.txt'Length of output: 668
The subroutine
parse_AF
is introduced to parse allele frequencies from annotation matches. Ensure that this subroutine is robust and handles all expected input formats correctly. Additionally, the median function is added; verify that it is used correctly and that its results are handled appropriately.
resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py
Outdated
Show resolved
Hide resolved
resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py
Outdated
Show resolved
Hide resolved
resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py
Outdated
Show resolved
Hide resolved
resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py
Outdated
Show resolved
Hide resolved
resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py
Outdated
Show resolved
Hide resolved
resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py
Show resolved
Hide resolved
resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 3
Configuration used: CodeRabbit UI
Files selected for processing (2)
- resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl (10 hunks)
- resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py (13 hunks)
Additional comments: 16
resources/analysisTools/indelCallingWorkflow/checkSampleSwap_TiN.pl (7)
- 21-25: The parameters
$biasScript
,$outfile_RG
,$outfile_SR
, and$outfile_AS
have been removed, and new parameters$chr_prefix
,$outfile_tindaVCF
, and$outfile_SJ
have been added. Ensure that the removal of parameters does not affect any downstream processes and that the new parameters are properly integrated and used within the script.- 453-463: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [460-479]
A new subroutine
parse_AF
has been added to parse allele frequency from annotation lines. Ensure that this subroutine is robust against different formats of allele frequency data that may be encountered in the annotation lines.
- 480-490: A
median
function has been added to calculate the median allele frequency. Verify that this function is called with the correct data and that it handles edge cases properly, such as when the list of values is empty or contains only one element.- 193-193: The filtering logic has been updated to include a new chromosome prefix variable
$chr_prefix
. Confirm that this variable is being set correctly and that the updated regex pattern correctly matches the intended chromosome names.- 244-248: The annotation command has been updated to include new databases for annotation and remove some previous ones. Ensure that the new annotation sources are correctly formatted and accessible, and that the removal of the old sources does not negatively impact the analysis.
- 432-445: The script calculates percentages for various metrics and stores them in a JSON file. Verify that the calculations are correct and that division by zero is handled appropriately.
- 457-457: File cleanup operations have been updated to remove additional files. Confirm that these files are no longer needed and that their removal does not affect any subsequent steps in the workflow.
resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py (9)
- 55-67: The
get_fixed_headers
function has been modified to conditionally include headers based on theargs.refgenome
andargs.no_control
parameters. This is a good use of conditional logic to adapt the headers to the context of the analysis.- 70-87: The
get_variable_headers
function uses a dictionary to map header names to regular expressions. This is a maintainable approach that allows for easy updates to header names.- 89-98: The
check_max_maf
function is a new addition that checks if any allele frequency in a given column exceeds a maximum threshold. This is a critical function for filtering variants based on frequency data and is a good example of encapsulating logic into a reusable component.- 163-163: The addition of a new filter
FREQ
in the VCF header is consistent with the PR objectives to introduce new filtering criteria based on allele frequency data.- 179-180: The variables
fixed_headers
andvariable_headers
are set using the newly added functions. This is a good example of using functions to modularize code and make it more readable.- 283-299: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [257-296]
The logic for setting flags based on various databases and allele frequencies has been refactored. This is a significant change that aligns with the PR objectives to adjust the annotation based on the MAF filter value from the XML.
- 357-385: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [313-368]
The confidence scoring logic has been modified, and while the previous review comments suggested refactoring this complex logic, the developer has chosen not to do so in this PR. The changes made here seem to be in line with the new allele frequency filtering logic.
- 478-536: The region confidence scoring logic has been updated to include new conditions. This is a complex section of code that could benefit from further refactoring, but it appears to be correctly implemented based on the PR objectives.
- 641-657: New command-line arguments have been added to control the maximum allele frequencies for different databases and the option to run low MAF punishment. This is consistent with the PR objectives and provides users with more control over the filtering process.
resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py
Outdated
Show resolved
Hide resolved
resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 2
Configuration used: CodeRabbit UI
Files selected for processing (1)
- resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py (13 hunks)
Additional comments: 6
resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py (6)
- 55-67: The
get_fixed_headers
function dynamically adjusts headers based on the reference genome and control data presence. This is a good practice for maintainability and flexibility. However, ensure that all potential headers are accounted for in downstream processing.- 70-87: The
get_variable_headers
function effectively manages variable headers based on conditions such as the reference genome used and the presence of control data. This modular approach enhances code readability and maintainability.- 89-98: The
check_max_maf
function correctly identifies if any allele frequency exceeds the maximum allowed. This is crucial for filtering variants based on population frequency data. Ensure that thehelp
dictionary structure andcolumn_valid_key
usage are consistent throughout the script.- 45-103: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [100-114]
The header generation in the
main
function is well-structured, incorporating both fixed and additional headers. This approach allows for flexible header management. Ensure that all headers are correctly documented and used consistently in downstream analyses.
- 179-180: The retrieval of fixed and variable headers using
get_fixed_headers
andget_variable_headers
functions is correctly implemented. This ensures that the script dynamically adjusts to the input data's specifics, such as the presence of control samples and the reference genome used.- 639-655: The argument parsing section is comprehensive, providing a wide range of options for customizing the script's behavior. Ensure that the help messages are clear and informative for all users, particularly for complex options like
--hetcontr
and--homcontr
.
## Helper functions | ||
def is_hg37(args): | ||
return args.refgenome[0] == "hs37d5" | ||
|
||
def is_hg38(args): | ||
return args.refgenome[0] == "GRCh38" | ||
|
||
def get_fixed_headers(args): | ||
fixed_headers = ["^QUAL$", "^INFO$", "^FILTER$", "MAPABILITY", "SIMPLE_TANDEMREPEATS", | ||
"REPEAT_MASKER", "^CONFIDENCE$", "^CLASSIFICATION$", "^REGION_CONFIDENCE$", | ||
"^PENALTIES$", "^REASONS$"] | ||
|
||
hs37d5_headers = ["DAC_BLACKLIST", "DUKE_EXCLUDED", "HISEQDEPTH", "SELFCHAIN"] | ||
if is_hg37(args): | ||
fixed_headers += hs37d5_headers | ||
|
||
if not args.no_control: | ||
fixed_headers += ["^INFO_control", "^ANNOTATION_control$"] | ||
|
||
return fixed_headers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of is_hg37
and is_hg38
relies on hard-coded reference genome names. This approach is not scalable or flexible for future reference genomes. Consider adopting a more dynamic method for genome version checks.
- return args.refgenome[0] == "hs37d5"
- return args.refgenome[0] == "GRCh38"
+ return args.refgenome[0].startswith("hs37")
+ return args.refgenome[0].startswith("GRCh38")
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
## Helper functions | |
def is_hg37(args): | |
return args.refgenome[0] == "hs37d5" | |
def is_hg38(args): | |
return args.refgenome[0] == "GRCh38" | |
def get_fixed_headers(args): | |
fixed_headers = ["^QUAL$", "^INFO$", "^FILTER$", "MAPABILITY", "SIMPLE_TANDEMREPEATS", | |
"REPEAT_MASKER", "^CONFIDENCE$", "^CLASSIFICATION$", "^REGION_CONFIDENCE$", | |
"^PENALTIES$", "^REASONS$"] | |
hs37d5_headers = ["DAC_BLACKLIST", "DUKE_EXCLUDED", "HISEQDEPTH", "SELFCHAIN"] | |
if is_hg37(args): | |
fixed_headers += hs37d5_headers | |
if not args.no_control: | |
fixed_headers += ["^INFO_control", "^ANNOTATION_control$"] | |
return fixed_headers | |
## Helper functions | |
def is_hg37(args): | |
return args.refgenome[0].startswith("hs37") | |
def is_hg38(args): | |
return args.refgenome[0].startswith("GRCh38") | |
def get_fixed_headers(args): | |
fixed_headers = ["^QUAL$", "^INFO$", "^FILTER$", "MAPABILITY", "SIMPLE_TANDEMREPEATS", | |
"REPEAT_MASKER", "^CONFIDENCE$", "^CLASSIFICATION$", "^REGION_CONFIDENCE$", | |
"^PENALTIES$", "^REASONS$"] | |
hs37d5_headers = ["DAC_BLACKLIST", "DUKE_EXCLUDED", "HISEQDEPTH", "SELFCHAIN"] | |
if is_hg37(args): | |
fixed_headers += hs37d5_headers | |
if not args.no_control: | |
fixed_headers += ["^INFO_control", "^ANNOTATION_control$"] | |
return fixed_headers |
region_conf -= 1 | ||
reasons += "SelfchainAndOrSegdup(-1)" | ||
|
||
if any(word in help["REPEAT_MASKER"] for word in ["Simple_repeat", "Low_", "Satellite", ]) or help["SIMPLE_TANDEMREPEATS_VALID"]: | ||
region_conf -= 2 | ||
reasons += "Repeat(-2)" | ||
# other repeat elements to penalize at least a bit | ||
elif help["REPEAT_MASKER_VALID"]: | ||
region_conf -= 1 | ||
reasons += "Other_repeat(-1)" | ||
|
||
# Mapability is 1 for unique regions, 0.5 for regions appearing twice, 0.33... 3times, ... | ||
# Everything with really high number of occurences is artefacts | ||
# does not always correlate with the above regions | ||
# is overestimating badness bc. of _single_ end read simulations | ||
if help["MAPABILITY"] == ".": | ||
# in very rare cases (CEN), there is no mapability => ".", which is not numeric but interpreted as 0 | ||
region_conf -= 5 | ||
reasons += "Not_mappable(-5)" | ||
else: | ||
reduce = 0 | ||
# can have several entries for indels, e.g. 0.5&0.25 - take worst (lowest) or best (highest)?! | ||
mapability = min(map(float, help["MAPABILITY"].split("&"))) # just chose one - here: min | ||
if mapability < 0.5: | ||
# 0.5 does not seem to be that bad: region appears another time in | ||
# the genome and we have paired end data! | ||
|
||
## DUKE, DAC, Hiseq, Self chain are only available for hg19 reference genome | ||
if args.refgenome[0] == "hs37d5" and not args.skipREMAP: | ||
|
||
if help["DUKE_EXCLUDED_VALID"] or help["DAC_BLACKLIST_VALID"] or help["HISEQDEPTH_VALID"]: | ||
region_conf -= 3 # really bad region, usually centromeric repeats | ||
reasons += "Blacklist(-3)" | ||
|
||
if help["ANNOVAR_SEGDUP_COL_VALID"] or help["SELFCHAIN_VALID"]: | ||
region_conf -= 1 | ||
reduce += 1 | ||
reasons += "SelfchainAndOrSegdup(-1)" | ||
|
||
#if mapability < 0.4: # 3-4 times appearing region is worse but still not too bad | ||
# region_conf -= 1 | ||
# reduce += 1 | ||
if help["ANNOVAR_SEGDUP_COL_VALID"]: | ||
region_conf -= 1 | ||
reasons += "Segdup(-1)" | ||
|
||
if mapability < 0.25: # > 4 times appearing region | ||
if any(word in help["REPEAT_MASKER"] for word in ["Simple_repeat", "Low_", "Satellite", ]) or help["SIMPLE_TANDEMREPEATS_VALID"]: | ||
region_conf -= 2 | ||
reasons += "Repeat(-2)" | ||
# other repeat elements to penalize at least a bit | ||
elif help["REPEAT_MASKER_VALID"]: | ||
region_conf -= 1 | ||
reasons += "Other_repeat(-1)" | ||
|
||
# Mapability is 1 for unique regions, 0.5 for regions appearing twice, 0.33... 3times, ... | ||
# Everything with really high number of occurences is artefacts | ||
# does not always correlate with the above regions | ||
# is overestimating badness bc. of _single_ end read simulations | ||
if help["MAPABILITY"] == ".": | ||
# in very rare cases (CEN), there is no mapability => ".", which is not numeric but interpreted as 0 | ||
region_conf -= 5 | ||
reasons += "Not_mappable(-5)" | ||
else: | ||
reduce = 0 | ||
# can have several entries for indels, e.g. 0.5&0.25 - take worst (lowest) or best (highest)?! | ||
mapability = min(map(float, help["MAPABILITY"].split("&"))) # just chose one - here: min | ||
if mapability < 0.5: | ||
# 0.5 does not seem to be that bad: region appears another time in | ||
# the genome and we have paired end data! | ||
region_conf -= 1 | ||
reduce += 1 | ||
|
||
if mapability < 0.1: # > 5 times is bad | ||
region_conf -= 2 | ||
reduce += 2 | ||
#if mapability < 0.4: # 3-4 times appearing region is worse but still not too bad | ||
# region_conf -= 1 | ||
# reduce += 1 | ||
|
||
if mapability < 0.05: # these regions are clearly very bad (Lego stacks) | ||
region_conf -= 3 | ||
reduce += 3 | ||
if mapability < 0.25: # > 4 times appearing region | ||
region_conf -= 1 | ||
reduce += 1 | ||
|
||
if mapability < 0.1: # > 5 times is bad | ||
region_conf -= 2 | ||
reduce += 2 | ||
|
||
if mapability < 0.05: # these regions are clearly very bad (Lego stacks) | ||
region_conf -= 3 | ||
reduce += 3 | ||
|
||
reasons += "Low_mappability(%s=>-%d)"%(help["MAPABILITY"], reduce) | ||
reasons += "Low_mappability(%s=>-%d)"%(help["MAPABILITY"], reduce) | ||
|
||
if classification != "somatic" and not "PASS" in help["FILTER"]: | ||
# such an indel is probably also "unclear" and not "germline" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [257-534]
The variant classification and confidence scoring logic is extensive and handles a wide range of conditions. While the logic appears correct, the extensive use of magic numbers and complex conditional structures makes this section hard to maintain. Consider refactoring for clarity and maintainability, possibly by breaking down into smaller functions or using a more descriptive approach for magic numbers.
- confidence -= 1
+ confidence -= CONFIDENCE_DECREMENT_FOR_CONDITION
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not working on this logic in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NagaComBio Understood, thank you for the clarification. If there's anything else you need assistance with, feel free to ask!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 1
Configuration used: CodeRabbit UI
Files selected for processing (1)
- resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py (13 hunks)
Additional comments: 3
resources/analysisTools/indelCallingWorkflow/confidenceAnnotation_Indels.py (3)
- 48-70: The addition of helper functions
is_hg37
,is_hg38
,get_fixed_headers
, andget_variable_headers
improves modularity and readability by encapsulating specific logic for determining the reference genome and generating header lists based on conditions. This approach aligns well with best practices for maintainability and code clarity.- 91-100: The
check_max_maf
function introduces a streamlined way to filter variants based on allele frequency, enhancing the script's flexibility in handling different databases and their respective maximum allele frequency thresholds. This function is a valuable addition for maintaining the accuracy and specificity of variant annotation.- 475-539: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [102-658]
The main function has been significantly updated to incorporate new command-line arguments for controlling maximum allele frequencies and to utilize the newly added helper functions for conditional logic based on the reference genome and control data presence. These changes enhance the script's functionality and adaptability to different datasets and analysis requirements. However, ensure that all new command-line arguments are documented and validated to prevent potential user errors.
region_conf -= 1 | ||
reasons += "SelfchainAndOrSegdup(-1)" | ||
|
||
if any(word in help["REPEAT_MASKER"] for word in ["Simple_repeat", "Low_", "Satellite", ]) or help["SIMPLE_TANDEMREPEATS_VALID"]: | ||
region_conf -= 2 | ||
reasons += "Repeat(-2)" | ||
# other repeat elements to penalize at least a bit | ||
elif help["REPEAT_MASKER_VALID"]: | ||
region_conf -= 1 | ||
reasons += "Other_repeat(-1)" | ||
|
||
# Mapability is 1 for unique regions, 0.5 for regions appearing twice, 0.33... 3times, ... | ||
# Everything with really high number of occurences is artefacts | ||
# does not always correlate with the above regions | ||
# is overestimating badness bc. of _single_ end read simulations | ||
if help["MAPABILITY"] == ".": | ||
# in very rare cases (CEN), there is no mapability => ".", which is not numeric but interpreted as 0 | ||
region_conf -= 5 | ||
reasons += "Not_mappable(-5)" | ||
else: | ||
reduce = 0 | ||
# can have several entries for indels, e.g. 0.5&0.25 - take worst (lowest) or best (highest)?! | ||
mapability = min(map(float, help["MAPABILITY"].split("&"))) # just chose one - here: min | ||
if mapability < 0.5: | ||
# 0.5 does not seem to be that bad: region appears another time in | ||
# the genome and we have paired end data! | ||
|
||
## DUKE, DAC, Hiseq, Self chain are only available for hg19 reference genome | ||
if args.refgenome[0] == "hs37d5" and not args.skipREMAP: | ||
|
||
if help["DUKE_EXCLUDED_VALID"] or help["DAC_BLACKLIST_VALID"] or help["HISEQDEPTH_VALID"]: | ||
region_conf -= 3 # really bad region, usually centromeric repeats | ||
reasons += "Blacklist(-3)" | ||
|
||
if help["ANNOVAR_SEGDUP_COL_VALID"] or help["SELFCHAIN_VALID"]: | ||
region_conf -= 1 | ||
reduce += 1 | ||
reasons += "SelfchainAndOrSegdup(-1)" | ||
|
||
#if mapability < 0.4: # 3-4 times appearing region is worse but still not too bad | ||
# region_conf -= 1 | ||
# reduce += 1 | ||
if help["ANNOVAR_SEGDUP_COL_VALID"]: | ||
region_conf -= 1 | ||
reasons += "Segdup(-1)" | ||
|
||
if mapability < 0.25: # > 4 times appearing region | ||
if any(word in help["REPEAT_MASKER"] for word in ["Simple_repeat", "Low_", "Satellite", ]) or help["SIMPLE_TANDEMREPEATS_VALID"]: | ||
region_conf -= 2 | ||
reasons += "Repeat(-2)" | ||
# other repeat elements to penalize at least a bit | ||
elif help["REPEAT_MASKER_VALID"]: | ||
region_conf -= 1 | ||
reasons += "Other_repeat(-1)" | ||
|
||
# Mapability is 1 for unique regions, 0.5 for regions appearing twice, 0.33... 3times, ... | ||
# Everything with really high number of occurences is artefacts | ||
# does not always correlate with the above regions | ||
# is overestimating badness bc. of _single_ end read simulations | ||
if help["MAPABILITY"] == ".": | ||
# in very rare cases (CEN), there is no mapability => ".", which is not numeric but interpreted as 0 | ||
region_conf -= 5 | ||
reasons += "Not_mappable(-5)" | ||
else: | ||
reduce = 0 | ||
# can have several entries for indels, e.g. 0.5&0.25 - take worst (lowest) or best (highest)?! | ||
mapability = min(map(float, help["MAPABILITY"].split("&"))) # just chose one - here: min | ||
if mapability < 0.5: | ||
# 0.5 does not seem to be that bad: region appears another time in | ||
# the genome and we have paired end data! | ||
region_conf -= 1 | ||
reduce += 1 | ||
|
||
if mapability < 0.1: # > 5 times is bad | ||
region_conf -= 2 | ||
reduce += 2 | ||
#if mapability < 0.4: # 3-4 times appearing region is worse but still not too bad | ||
# region_conf -= 1 | ||
# reduce += 1 | ||
|
||
if mapability < 0.05: # these regions are clearly very bad (Lego stacks) | ||
region_conf -= 3 | ||
reduce += 3 | ||
if mapability < 0.25: # > 4 times appearing region | ||
region_conf -= 1 | ||
reduce += 1 | ||
|
||
if mapability < 0.1: # > 5 times is bad | ||
region_conf -= 2 | ||
reduce += 2 | ||
|
||
if mapability < 0.05: # these regions are clearly very bad (Lego stacks) | ||
region_conf -= 3 | ||
reduce += 3 | ||
|
||
reasons += "Low_mappability(%s=>-%d)"%(help["MAPABILITY"], reduce) | ||
reasons += "Low_mappability(%s=>-%d)"%(help["MAPABILITY"], reduce) | ||
|
||
if classification != "somatic" and not "PASS" in help["FILTER"]: | ||
# such an indel is probably also "unclear" and not "germline" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [48-658]
While the script has been enhanced with new functionalities and improved modularity through the addition of helper functions, it's important to continuously evaluate the complexity of the variant annotation logic. Consider future refactoring opportunities to further simplify and decompose complex logic into smaller, more manageable functions. This will aid in maintainability and ease of understanding for new contributors to the project.
Overview
Merging the hg38 indel calling branch to the main hg19 branch. Hg19 somatic indel analysis results will remain same as with version 2.6.0. There are some changes in the check sample swap part that will affect both hg19 and hg38.
Main
Minor
SNP_support_germline
is based on the MAF filter value provided in the XML.Integration test
Summary by CodeRabbit
New Features
--skipREMAP
andrunlowmaf
.Enhancements
Bug Fixes
Refactor
COWorkflowsBasePlugin
to version 1.4.2.Documentation