Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-qc V1 Refactor #11

Merged
merged 18 commits into from
Oct 16, 2023
Merged

Pre-qc V1 Refactor #11

merged 18 commits into from
Oct 16, 2023

Conversation

amadeovezz
Copy link
Collaborator

@amadeovezz amadeovezz commented Aug 4, 2023

What

The original intention of this PR is to refactor the param parsing at the beginning of the workflow (pre-qc). However in order to do this - the scope of this work has been extended to a general refactor of the entire pipeline. The goal here is to ultimately improve readability and extensibility of this pipeline. However, before we can tackle extensibility, it is often helpful to abstract away and group together complicated functionality. This is the central focus of this PR.

TODO

  • Abstract away all param parsing and set-up into a seperate module
  • Abstract away all utility functions
  • Reduce complicated channel manipulations
  • Improve readability of conditionals
  • Couple param logic to specific processes such that it is easier to understand inputs
  • Modify the JSON format, such that parsing params is easier and more readable
  • Regression test pipeline with code paths below

CODE PATHS

Note: demuxlet punted to future PR's

  • demultiplex method: demux, merge files: true, run doublet finder: true (ERRORS - see below)
  • demultiplex method: demux, merge files: true, run doublet finder: false
  • demultiplex method: demux, merge files: false, run doublet finder: true (ERRORS - see below)
  • demultiplex method: demux, merge files: false, run doublet finder: false
  • demultiplex method: freemuxlet, merge files: true, run doublet finder: true
  • demultiplex method: freemuxlet, merge files: true, run doublet finder: false
  • demultiplex method: freemuxlet, merge files: false, run doublet finder: true
  • demultiplex method: freemuxlet, merge files: false, run doublet finder: false

REGRESSION testing

  • demultiplex method: freemuxlet, merge files: true, run doublet finder: true
  • demultiplex method: freemuxlet, merge files: true, run doublet finder: false
  • demultiplex method: freemuxlet, merge files: false, run doublet finder: true
  • demultiplex method: freemuxlet, merge files: false, run doublet finder: false

DEMUXLET ERRORS

Pre-refactor I am getting these errors when doublet_finder = true.

ERROR ~ Error executing process > 'FIND_DOUBLETS (1)'

Caused by:
  Process `FIND_DOUBLETS (1)` terminated with an error exit status (1)

Command executed:

  Rscript /c4/home/amazzara/data_processing_pipelines/single_cell_RNAseq/bin/find_doublets.R raw_feature_bc_matrix.h5 TEST-POOL-DM2-SCG1.clust1.samples.reduced.tsv TEST-POOL-DM2-SCG1 100 100 3 21212 /c4/home/amazzara/data_processing_pipelines/single_cell_RNAseq

Command exit status:
  1

Command output:
  [2023-09-19 12:32:07] Dimension of the GEX count data:11456 x 99
  [1] "orig.ident"   "nCount_RNA"   "nFeature_RNA" "DROPLET.TYPE" "BEST.GUESS"  

Command error:
  INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
  The legacy packages maptools, rgdal, and rgeos, underpinning this package
  will retire shortly. Please refer to R-spatial evolution reports on
  https://r-spatial.org/r/2023/05/15/evolution4.html for details.
  This package is now running under evolution status 0 
  Attaching SeuratObject
  -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
  v dplyr     1.1.2     v readr     2.1.4
  v forcats   1.0.0     v stringr   1.5.0
  v ggplot2   3.4.2     v tibble    3.2.1
  v lubridate 1.9.2     v tidyr     1.3.0
  v purrr     1.0.1     
  -- Conflicts ------------------------------------------ tidyverse_conflicts() --
  x dplyr::filter() masks stats::filter()
  x dplyr::lag()    masks stats::lag()
  i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
  Genome matrix has multiple modalities, returning a list of matrices for this genome
  Error in sngObj@misc$scStat$fmlDropletTypeProp["DBL", ] : 
    subscript out of bounds
  Calls: runDoubletFinder
  Execution halted

Work dir:
  /c4/scratch/amazzara/nextflow/e8/eb21b3e1d752e13184359e518da9c0

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

To scope down this PR, we will punt this to the future.

FUTURE TODO's

  • Get proper VCF for demuxlet
  • Fix demultiplex errors
  • Add more data types for testing
  • Think about test strategies
  • Figure out a good way to abstract away the remaining channel manipulations
  • Consolidate shell scripts
  • Go through inputs and outputs of each process and ensure they are all used (sometimes inputs are passed in for symlink purposes)
  • Fix hardcoded paths. Ideally we just want to point the script to some input files.

BONUS

  • Add gitignore
  • Improve README.md
  • Finish pipeline diagram (pending PRO LucidCharts subscription)

What testing looks like

Screenshot 2023-10-03 at 4 05 57 PM

@erflynn
Copy link
Collaborator

erflynn commented Aug 9, 2023

this is looking great - thank you!

@amadeovezz
Copy link
Collaborator Author

most of the work is actually in the pre_qc file which I'm going to push later today :)

@amadeovezz amadeovezz changed the title WIP: Clean up params WIP: V1 Refactor Sep 11, 2023
@amadeovezz amadeovezz changed the title WIP: V1 Refactor WIP: Pre-qc V1 Refactor Sep 12, 2023
…ple items to a channel (for SEPARATE_FMX). We also need to be explicit about the values we add to SEPARATE_FMX()
@amadeovezz
Copy link
Collaborator Author

@erflynn the refactor in terms of code is complete - all other improvements will be punted to future PR's (this PR has already creeped in scope). Feel free to add any comments if you like :)

Tomorrow I'll be doing some regression testing as a final check (this will not change the structure of the code).

@amadeovezz amadeovezz changed the title WIP: Pre-qc V1 Refactor Pre-qc V1 Refactor Oct 3, 2023
@amadeovezz amadeovezz requested a review from erflynn October 3, 2023 19:44
// We de-multiplex if we have merged libraries
ch_sample_map = Channel.empty()

if ( params.settings.demux_method == "freemuxlet"){
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a small bug that I caught via my tests that made this a bit more bloated then I would like. But def would like to simplify this logic and make it more readable.

@@ -441,10 +441,29 @@ process UNMERGE_FMX {
process SEPARATE_FMX {
publishDir "${params.project_dir}/data/single_cell_GEX/processed/${library}/freemuxlet", mode: 'copy', pattern: "${library}*"
publishDir "${params.project_dir}/data/single_cell_GEX/logs/${library}/", mode: 'copy', pattern: ".command.log", saveAs: { filename -> "separate_fmx.log" }
input:
Copy link
Collaborator Author

@amadeovezz amadeovezz Oct 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this visual diff is strangely represented here. But all I've done is added a copied the original SEPARATE_FMX and added a specific version for the pre-qc pipeline. I didn't want to modify the original SEPARATE_FMX since it is used by other pipelines and haven't tested it yet.

@erflynn
Copy link
Collaborator

erflynn commented Oct 6, 2023

This looks fantastic! Awesome work.

Really excited about the refactor and addition of tests, also the README updates are very helpful. I'll be trying this when I'm back, and can also help set up some test cases then.

I think good to merge, but worth flagging this to folks in the lab who might be using it that there is a major update in this version for "pre_qc" step.

@amadeovezz amadeovezz merged commit d2a97ac into main Oct 16, 2023
@amadeovezz amadeovezz deleted the clean-up-params branch October 16, 2023 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants