Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip MarkDuplicates when UMIs are used #891

Closed
berguner opened this issue Oct 28, 2022 · 5 comments
Closed

Skip MarkDuplicates when UMIs are used #891

berguner opened this issue Oct 28, 2022 · 5 comments
Milestone

Comments

@berguner
Copy link

berguner commented Oct 28, 2022

Description of feature

Hi,
I would suggest disabling Picard MarkDuplicates when UMIs are used for deduplication. For example, --skip_markduplicates can be enabled by default if --with_umi was also enabled.

@drpatelh drpatelh added this to the 3.10 milestone Dec 12, 2022
@drpatelh
Copy link
Member

@MatthiasZepper what are your thoughts on this? The easiest option is to never run Markduplicates when --with_umi is specified. You were running Markduplicates with UMIs to do some investigation into technical vs biological duplication? This wouldn't be possible anymore with the easy fix.

@MatthiasZepper
Copy link
Member

MatthiasZepper commented Dec 19, 2022

I indeed ran some comparative analyses, but most by manually writing bash scripts, because I didn't trust myself to really understand all the minute details of the pipeline. I get, why the channel name ch_genome_bam is reused throughout the pipeline (so processes can be skipped easily), but it always gives me a headache to retrace the sequential connection of processes.

Since both, MARK_DUPLICATES_PICARD and DEDUP_UMI_UMITOOLS_GENOME use ch_genome_bam as input and both write their output to it, I repeatedly asked myself whether they are executed one after the other or if they run directly in parallel as soon as the aligner's output is present. Furthermore, I am still unsure "which version" of the ch_genome_bam the processes BEDTOOLS_GENOMECOV, PRESEQ_LCEXTRAP and STRINGTIE_STRINGTIE use.

Biologically, there is indeed little use of running MarkDuplicates after umi-tools dedup, although both tools do slightly differ in their strategies. Lacking UMI information, MarkDuplicates can't differentiate biological and technical duplication, but can spot optical duplicates specifically by means of their position when provided with appropriate parameters for the flow cell type and instrument via ext.args.

In summary: Weak agreement. I do not really see a use case for running both tools, and agree that most users intuitively expect it to be an either-or -scenario between the two. On the other hand, putting that additional entry in the params file is also straightforward.

@drpatelh
Copy link
Member

I get, why the channel name ch_genome_bam is reused throughout the pipeline (so processes can be skipped easily)

Yes, if we have explicit names for all of these channels it could quite equally get complicated tracing back the original input/output channels. This easily allows us to add/remove additional aligners or other processes with minimal effort. It's a double edge sword but I see where you are coming from.

and agree that most users intuitively expect it to be an either-or -scenario between the two

Ok. I will hard-code the option to skip picard Markduplicates if UMIs are present. It can always be run outside of the pipeline if required in edge case scenarios.

drpatelh added a commit to drpatelh/nf-core-rnaseq that referenced this issue Dec 19, 2022
@drpatelh drpatelh mentioned this issue Dec 19, 2022
@drpatelh
Copy link
Member

Fixed in #911

drpatelh added a commit that referenced this issue Dec 19, 2022
@m3hdad
Copy link

m3hdad commented Mar 26, 2024

Objection, your honor! Dupradar needs preprocessing by marking duplicates!

https://nfcore.slack.com/archives/CE8SSJV3N/p1711455688377539?thread_ts=1666890604.450699&cid=CE8SSJV3N

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants