Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed pairwise batch effect checks from WDL workflow #617

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

shadizaheri
Copy link
Contributor

@shadizaheri shadizaheri commented Nov 9, 2023

Description:

In this PR, I made significant changes to the WDL workflow responsible for analyzing batch effects in genomic data. Our primary objective was to simplify and optimize the pipeline by removing the pairwise batch effect checks.

Changes Made:

  1. Removed Pairwise Batch Effect Checks: Previously, the workflow considered two types of batch effects: one that compared a single batch against all other batches ("1 vs. all") and another that performed pairwise comparisons of each batch against every other batch. We decided to focus solely on the "1 vs. all" checks and remove the pairwise comparisons to streamline the analysis.
  2. Cleaned Up Redundant Tasks: As part of this update, we also removed tasks and scatter operations related specifically to the pairwise checks, further simplifying the code.

A high-level overview of the changes I implemented, highlighting the differences between this version and eph_turn_off_unstable_af_filter branch:

  • I removed MakeBatchPairsList Task: This task generates a list of batch pairs for pairwise comparison. Since I don't want pairwise comparisons, I removed this task.

  • I removed the Scatter Blocks for Pairwise Comparisons: The scatter block that calls helper.check_batch_effects for each pair in batch_pairs were removed.

  • I removed MergeVariantFailureLists Call for Pairwise Checks: The call to MergeVariantFailureLists as merge_pairwise_checks that collects results from pairwise batch effect detection was removed.

  • I adjusted the MakeReclassificationTable Task: This task takes inputs from both pairwise and one-vs-all checks. Since I removed the pairwise checks, I also removed the pairwise_fails input and any related logic inside the task that deals with pairwise comparison results.

  • I removed any references to Pairwise outputs: Any output or logic that depends on the results of the pairwise comparisons was removed.

  • I modified the make_batch_effect_reclassification_table.PCRMinus_only.R script to reflect these changes.

Rationale:

Our rationale for these changes was multifaceted:

  • Efficiency: Pairwise checks can be computationally intensive, especially when dealing with a large number of batches. By focusing on the "1 vs. all" approach, we can get a broader view of batch effects without the overhead of numerous pairwise comparisons.
  • Simplicity: Reducing the complexity of the workflow makes it easier to understand, maintain, and troubleshoot.
  • Focus on Broad Effects: The "1 vs. all" checks provide a holistic view of how a particular batch compares to the general trend across all batches. This approach can highlight more substantial, systemic batch effects rather than the nuanced differences between individual batches.

By implementing these changes, we aim to provide a more streamlined, efficient, and intuitive workflow for analyzing batch effects.

Tests
I have tested on the recent updates using the batch list and datasets located in the Phase 1 workspace. For those who have the phase 1 AoU permissions, the test results can be viewed at the following Job Manager Results.

@shadizaheri shadizaheri marked this pull request as ready for review December 5, 2023 00:03
Copy link
Collaborator

@epiercehoffman epiercehoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, Shadi!

My big question is: Is there ever a situation where we would still want to perform pairwise comparisons? I know we aren't able to at the scale of AoU, but are there smaller projects where we would want to retain this capability? I don't know the answer, but if so, then it would be best to be able to toggle those comparisons on/off instead of removing them entirely. We could still use this version for AoU but I want to get this question answered before merging into main.

I've made some style comments. Also make sure to delete src/sv-pipeline/scripts/downstream_analysis_and_filtering/.Rhistory - looks like that is an extra file that snuck in unintentionally.

--info ALL \
--no-samples \
~{vcf} "~{prefix}.vcf2bed.bed"
--info ALL \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate you taking this opportunity to clean up some of the WDL indentation. However, the indentation in these bash sections is for readability - ie. these lines are indented because they are a continuation of the svtk command, and lines 252-257 are indented further because they are a continuation of the command substitution. Let's keep the bash indentation as it was to help future readers

@@ -118,13 +91,13 @@ categorize.failures <- function(dat,pairwise.cutoff,onevsall.cutoff){
###Read command-line arguments
args <- commandArgs(trailingOnly=T)
freq.table.in <- as.character(args[1])
pairwise.in <- as.character(args[2])
#pairwise.in <- as.character(args[2])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to fully remove pairwise checks it would be cleaner to delete these lines rather than commenting them out

merged <- merge(pairwise.fails,onevsall.fails,all=T,sort=F,by="VID")
if(nrow(merged) > 0){
merged[,-1] <- apply(merged[,-1],2,function(vals){
#merged <- merge(onevsall.fails,all=T,sort=F,by="VID")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#merged <- merge(onevsall.fails,all=T,sort=F,by="VID")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants