Skip MarkDuplicates when UMIs are used #891

berguner · 2022-10-28T04:44:37Z

Description of feature

Hi,
I would suggest disabling Picard MarkDuplicates when UMIs are used for deduplication. For example, --skip_markduplicates can be enabled by default if --with_umi was also enabled.

The text was updated successfully, but these errors were encountered:

drpatelh · 2022-12-18T19:52:40Z

@MatthiasZepper what are your thoughts on this? The easiest option is to never run Markduplicates when --with_umi is specified. You were running Markduplicates with UMIs to do some investigation into technical vs biological duplication? This wouldn't be possible anymore with the easy fix.

MatthiasZepper · 2022-12-19T15:11:49Z

I indeed ran some comparative analyses, but most by manually writing bash scripts, because I didn't trust myself to really understand all the minute details of the pipeline. I get, why the channel name ch_genome_bam is reused throughout the pipeline (so processes can be skipped easily), but it always gives me a headache to retrace the sequential connection of processes.

Since both, MARK_DUPLICATES_PICARD and DEDUP_UMI_UMITOOLS_GENOME use ch_genome_bam as input and both write their output to it, I repeatedly asked myself whether they are executed one after the other or if they run directly in parallel as soon as the aligner's output is present. Furthermore, I am still unsure "which version" of the ch_genome_bam the processes BEDTOOLS_GENOMECOV, PRESEQ_LCEXTRAP and STRINGTIE_STRINGTIE use.

Biologically, there is indeed little use of running MarkDuplicates after umi-tools dedup, although both tools do slightly differ in their strategies. Lacking UMI information, MarkDuplicates can't differentiate biological and technical duplication, but can spot optical duplicates specifically by means of their position when provided with appropriate parameters for the flow cell type and instrument via ext.args.

In summary: Weak agreement. I do not really see a use case for running both tools, and agree that most users intuitively expect it to be an either-or -scenario between the two. On the other hand, putting that additional entry in the params file is also straightforward.

drpatelh · 2022-12-19T21:43:11Z

I get, why the channel name ch_genome_bam is reused throughout the pipeline (so processes can be skipped easily)

Yes, if we have explicit names for all of these channels it could quite equally get complicated tracing back the original input/output channels. This easily allows us to add/remove additional aligners or other processes with minimal effort. It's a double edge sword but I see where you are coming from.

and agree that most users intuitively expect it to be an either-or -scenario between the two

Ok. I will hard-code the option to skip picard Markduplicates if UMIs are present. It can always be run outside of the pipeline if required in edge case scenarios.

drpatelh · 2022-12-19T22:33:09Z

Fixed in #911

Fix #891

m3hdad · 2024-03-26T12:28:30Z

Objection, your honor! Dupradar needs preprocessing by marking duplicates!

https://nfcore.slack.com/archives/CE8SSJV3N/p1711455688377539?thread_ts=1666890604.450699&cid=CE8SSJV3N

berguner added the enhancement label Oct 28, 2022

drpatelh added this to the 3.10 milestone Dec 12, 2022

drpatelh added a commit to drpatelh/nf-core-rnaseq that referenced this issue Dec 19, 2022

Fix nf-core#891

cf4f463

drpatelh mentioned this issue Dec 19, 2022

Fix #891 #911

Merged

drpatelh closed this as completed Dec 19, 2022

drpatelh added a commit that referenced this issue Dec 19, 2022

Merge pull request #911 from drpatelh/updates

1160e14

Fix #891

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip MarkDuplicates when UMIs are used #891

Skip MarkDuplicates when UMIs are used #891

berguner commented Oct 28, 2022 •

edited

Loading

drpatelh commented Dec 18, 2022

MatthiasZepper commented Dec 19, 2022 •

edited

Loading

drpatelh commented Dec 19, 2022

drpatelh commented Dec 19, 2022

m3hdad commented Mar 26, 2024

Skip MarkDuplicates when UMIs are used #891

Skip MarkDuplicates when UMIs are used #891

Comments

berguner commented Oct 28, 2022 • edited Loading

Description of feature

drpatelh commented Dec 18, 2022

MatthiasZepper commented Dec 19, 2022 • edited Loading

drpatelh commented Dec 19, 2022

drpatelh commented Dec 19, 2022

m3hdad commented Mar 26, 2024

berguner commented Oct 28, 2022 •

edited

Loading

MatthiasZepper commented Dec 19, 2022 •

edited

Loading