UMI Support for hts_SuperDeduper! #261

bnjenner · 2024-04-25T23:22:23Z

Hello again,

I have added support for UMI-based PCR deduplication. It functions by just extracting the UMI from the read ID, converting it to bits, and appending it onto the sequence key used for PCR duplicate removal, essentially just extending the sequence. I also added an extra test function in hts_SuperDeduper_test that was useful during development. Additionally, I have ran this new version of the tool on a variety of datasets and it appears to work as intended. I originally wanted to provide some data for this PR, but ultimately I realized that testing fell more in line with bench marking then validating the algorithm. That being said, if you would like to see some of these results, I would be happy to add what I have found. Although, it is worth noting that the effectiveness of this method on single end reads, particularly TAGseq experiments with lower complexity, while comparable to umi_tools, is ultimately worse and for some reason very sensitive to whether or not you use hts_CutTrim before it. I am currently working on finding the best parameter settings for its application to TAGseq.

Anyways, let me know what you guys think, I am excited to finally have some eyes on this one.

…per" This reverts commit 9003e62.

bnjenner · 2024-04-27T00:21:01Z

In hindsight, umitools considers the edit distance when considering the umi in deduplication... let me know if you guys would prefer the implementation include this.

joe-angell

Looks good, couple minor issues.

joe-angell · 2024-04-26T21:02:51Z

common/src/read.cpp

@@ -35,6 +35,19 @@ std::string ReadBase::bit_to_str(const BitSet &bits) {
    return out;
 }

+boost::optional<BitSet> ReadBase::bitjoin(const boost::optional<BitSet> &bit1, const boost::optional<BitSet> &bit2, const char& del) {
+    if (del == '\0') { 
+        return bit2; 


why return bit2 in this case?

Bit1 is the umi sequence key, bit2 is the read sequence key. If the delimiter is unset (the umi method is not enabled) it will just return the sequence key and continue with the normal SuperDeduper algorithm. This was ultimately just how I chose to add the umi mode, specifically choosing to only include one parameter to enable it instead of a switch and an option for the delimiter. I also had mixed feelings about this, lol.

I think I would change this to just a bit join that doesn't do any special logic related to super d. Expanding more on this point, the del variable is not necessary for the join, it's only used as a special logic related to superd. This is the programming principle of separation of concerns.

joe-angell · 2024-04-26T21:09:24Z

common/src/read.h

+    const std::string get_umi(const char& del) {
+        size_t idx = id.rfind(del);
+        if (idx == std::string::npos) {
+            throw HtsRuntimeException("Did not detected extracted UMI. Be sure hts_ExtractUMI is run prior to hts_Superdeduper");


This error is too specific, maybe "Did not detected extracted UMI. Be sure hts_ExtractUMI is run prior to the current operation"

joe-angell · 2024-04-26T21:15:24Z

common/src/read.h

+            if (result.size() < 7) {
+                throw HtsRuntimeException("Read ID misformated. Does not have appropriate number of \":\" delimited columns for DRAGEN UMI format");
+            }
+            umi = result[7];


result.size() < 7 should that be < 8 ? you are getting the 8th element here.

So the read ID can actually only have 7 fields, the 8th field does not exist on a lot reads and is optionally for the UMI, from my understanding. That is why we I check if there are less than 7, which I understood to be the minimum.

my point is you will get undefined behavior reading result[7] if the vector is size 7, so you need to check the size here.

OH, you're totally right, I don't know why I thought the check would be < 7, it is definitely < 8 here. Illumina headers would definitely need 8 fields in this situation. Making the changes now.

joe-angell · 2024-04-27T17:45:49Z

hts_SuperDeduper/src/hts_SuperDeduper.h

+
+            if (del != '\0') { 
+                umi_seq = "";
+                for (const auto &r : i -> get_reads_non_const()) { 


use get_reads() when you are not modifying them and using const reference

joe-angell · 2024-04-27T21:28:57Z

hts_SuperDeduper/src/hts_SuperDeduper.h

            //check for existance, store or compare quality and replace:
            if ( tmpAvg < discard_qual ){ // averge qual must be less than discard_qual, ignored
                counters.increment_ignored();
-            } else if (auto key=i->get_key(start, length)) { // check for duplicate
+            } else if (auto key=i->bitjoin(umi_bit, (i -> get_key(start, length)), del)) { // check for duplicate


Thinking about this more, I thought the UMI was a way to dedup directly? Do we want to use both the key and the umi to dedup?

Hello, so umi_tools uses the mapping position as well as the umi for this deduplication. I figured that using UMI's in addition to the sequence key would be a good proxy for this and add to SuperDeduper's algorithm instead of replacing it. But, if there is a better method that we could implement, I will be happy to change it.

Yeah that makes sense. As for edit distance, i'm not sure there is an easy way to integrate that into superd, do you know how they do that in umi_tools?

It would definitely require changing the code a decent bit. I haven't looked at umi_tool's algorithm so I don't know exactly how they do it, and they have a few different methods. But here is what the documentation says:

"All methods start by identifying the reads with the same mapping position.

The simplest methods, unique and percentile, group reads with the exact same UMI. The network-based methods, cluster, adjacency and directional, build networks where nodes are UMIs and edges connect UMIs with an edit distance <= threshold (usually 1). The groups of reads are then defined from the network in a method-specific manner. For all the network-based methods, each read group is equivalent to one read count for the gene."

I'm inclined to say if that feature is wanted you could add it on in a new pr.

Cool. I am making the changes you suggested and will push soon. Are we good to merge or should we wait for more feedback?

msettles · 2024-06-06T16:35:36Z

somewhere however, now incorrect QUAL characters are getting introduced. Brad have you seen this anywhere else on this branch? In hts_Overlapper

bnjenner · 2024-06-06T16:40:15Z

I have not. Is this an issue present in the test dataset? Also, what is the exact issue? Is it just characters some how being introduced or switched?

…

On Thu, Jun 6, 2024 at 9:36 AM Matt Settles ***@***.***> wrote: somewhere however, now incorrect QUAL characters are getting introduced. Brad have you seen this anywhere else on this branch? — Reply to this email directly, view it on GitHub <#261 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMME3Q53LFNI6KNAVFRA7T3ZGCFW3AVCNFSM6AAAAABI5CZPJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJSHE2TGMRZG4> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

msettles · 2024-06-06T17:31:16Z

No in a large dataset, samtools fails with invalid QUAL character found. Looking at the read in the bam file it show a bad character in the middle of a read, only in overlapped SE reads though, Pairs work just fine. I’m backing up to before you changes now to see if the error is repeated there. And will report back Matt On June 6, 2024 at 9:40:37 AM, Bradley N. Jenner ***@***.***) wrote: I have not. Is this an issue present in the test dataset? Also, what is the exact issue? Is it just characters some how being introduced or switched?

On Thu, Jun 6, 2024 at 9:36 AM Matt Settles ***@***.***> wrote: somewhere however, now incorrect QUAL characters are getting introduced. Brad have you seen this anywhere else on this branch? — Reply to this email directly, view it on GitHub <#261 (comment)>, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AMME3Q53LFNI6KNAVFRA7T3ZGCFW3AVCNFSM6AAAAABI5CZPJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJSHE2TGMRZG4>

. You are receiving this because you modified the open/close state.Message ID: ***@***.***>

— Reply to this email directly, view it on GitHub <#261 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAE6RZXSVXMPMGLVNXX2VBTZGCGILAVCNFSM6AAAAABI5CZPJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJSHE3DGMJWGI> . You are receiving this because you commented.Message ID: ***@***.***>

bnjenner added 17 commits April 13, 2024 03:05

hts_SuperDeduper umi support init

290aa54

hts_SuperDeduper UMI Dedup implemented

129bb4d

umi parameter rename and update test JSON files for SuperDeduper

9003e62

hts_ExtractUMI del paramter

8d36ed0

Revert "umi parameter rename and update test JSON files for SuperDedu…

45a370d

…per" This reverts commit 9003e62.

merge with master branch for hts_ExtractUMI additional params

89d64c8

Working implementation of UMI mode in hts_SuperDeduper

3095dc9

hts_SuperDeduper help typo fix

ed85968

bitjoin() fix, watches for key to return boost::none

68af22b

update hts_ExtractUMI

0a4010c

hts_ExtractUMI hotfix

756b809

umi-support for hts_SuperDeduper

e0180c2

hts_ExtractUMI typo

61465bc

Merge branch 's4hts:master' into master

513fb47

hts_ExtractUMI merge conflict resolve

f52f4d3

Merge branch 'master' of github.com:bnjenner/HTStream

76e350e

More explicit extraction of UMI for DRAGEN format, hts_SuperDeduper

0d89a74

joe-angell approved these changes Apr 27, 2024

View reviewed changes

joe-angell reviewed Apr 27, 2024

View reviewed changes

bnjenner added 3 commits April 27, 2024 14:57

Error message update, switch to const get_reads()

370a436

get_umi(), result.size() < 8

8987f88

remove SuperDeduper logic from bit_join()

1b3a7ee

bnjenner merged commit 337eda0 into s4hts:master Apr 28, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UMI Support for hts_SuperDeduper! #261

UMI Support for hts_SuperDeduper! #261

bnjenner commented Apr 25, 2024

bnjenner commented Apr 27, 2024

joe-angell left a comment

joe-angell Apr 26, 2024

bnjenner Apr 27, 2024

joe-angell Apr 28, 2024 •

edited

Loading

joe-angell Apr 26, 2024

joe-angell Apr 26, 2024

bnjenner Apr 27, 2024

joe-angell Apr 27, 2024

bnjenner Apr 27, 2024

joe-angell Apr 27, 2024

joe-angell Apr 27, 2024

bnjenner Apr 27, 2024

joe-angell Apr 27, 2024

bnjenner Apr 27, 2024

joe-angell Apr 27, 2024

bnjenner Apr 27, 2024

joe-angell Apr 28, 2024

msettles commented Jun 6, 2024 •

edited

Loading

bnjenner commented Jun 6, 2024 via email

msettles commented Jun 6, 2024 via email

UMI Support for hts_SuperDeduper! #261

UMI Support for hts_SuperDeduper! #261

Conversation

bnjenner commented Apr 25, 2024

bnjenner commented Apr 27, 2024

joe-angell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joe-angell Apr 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msettles commented Jun 6, 2024 • edited Loading

bnjenner commented Jun 6, 2024 via email

msettles commented Jun 6, 2024 via email

joe-angell Apr 28, 2024 •

edited

Loading

msettles commented Jun 6, 2024 •

edited

Loading