-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UMI Support for hts_SuperDeduper! #261
Conversation
…per" This reverts commit 9003e62.
In hindsight, umitools considers the edit distance when considering the umi in deduplication... let me know if you guys would prefer the implementation include this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, couple minor issues.
common/src/read.cpp
Outdated
@@ -35,6 +35,19 @@ std::string ReadBase::bit_to_str(const BitSet &bits) { | |||
return out; | |||
} | |||
|
|||
boost::optional<BitSet> ReadBase::bitjoin(const boost::optional<BitSet> &bit1, const boost::optional<BitSet> &bit2, const char& del) { | |||
if (del == '\0') { | |||
return bit2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why return bit2 in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit1 is the umi sequence key, bit2 is the read sequence key. If the delimiter is unset (the umi method is not enabled) it will just return the sequence key and continue with the normal SuperDeduper algorithm. This was ultimately just how I chose to add the umi mode, specifically choosing to only include one parameter to enable it instead of a switch and an option for the delimiter. I also had mixed feelings about this, lol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would change this to just a bit join that doesn't do any special logic related to super d. Expanding more on this point, the del
variable is not necessary for the join, it's only used as a special logic related to superd. This is the programming principle of separation of concerns.
common/src/read.h
Outdated
const std::string get_umi(const char& del) { | ||
size_t idx = id.rfind(del); | ||
if (idx == std::string::npos) { | ||
throw HtsRuntimeException("Did not detected extracted UMI. Be sure hts_ExtractUMI is run prior to hts_Superdeduper"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error is too specific, maybe "Did not detected extracted UMI. Be sure hts_ExtractUMI is run prior to the current operation"
if (result.size() < 7) { | ||
throw HtsRuntimeException("Read ID misformated. Does not have appropriate number of \":\" delimited columns for DRAGEN UMI format"); | ||
} | ||
umi = result[7]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result.size() < 7
should that be < 8
? you are getting the 8th element here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the read ID can actually only have 7 fields, the 8th field does not exist on a lot reads and is optionally for the UMI, from my understanding. That is why we I check if there are less than 7, which I understood to be the minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my point is you will get undefined behavior reading result[7]
if the vector is size 7, so you need to check the size here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OH, you're totally right, I don't know why I thought the check would be < 7, it is definitely < 8 here. Illumina headers would definitely need 8 fields in this situation. Making the changes now.
|
||
if (del != '\0') { | ||
umi_seq = ""; | ||
for (const auto &r : i -> get_reads_non_const()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use get_reads()
when you are not modifying them and using const reference
//check for existance, store or compare quality and replace: | ||
if ( tmpAvg < discard_qual ){ // averge qual must be less than discard_qual, ignored | ||
counters.increment_ignored(); | ||
} else if (auto key=i->get_key(start, length)) { // check for duplicate | ||
} else if (auto key=i->bitjoin(umi_bit, (i -> get_key(start, length)), del)) { // check for duplicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, I thought the UMI was a way to dedup directly? Do we want to use both the key and the umi to dedup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, so umi_tools uses the mapping position as well as the umi for this deduplication. I figured that using UMI's in addition to the sequence key would be a good proxy for this and add to SuperDeduper's algorithm instead of replacing it. But, if there is a better method that we could implement, I will be happy to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense. As for edit distance, i'm not sure there is an easy way to integrate that into superd, do you know how they do that in umi_tools?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would definitely require changing the code a decent bit. I haven't looked at umi_tool's algorithm so I don't know exactly how they do it, and they have a few different methods. But here is what the documentation says:
"All methods start by identifying the reads with the same mapping position.
The simplest methods, unique and percentile, group reads with the exact same UMI. The network-based methods, cluster, adjacency and directional, build networks where nodes are UMIs and edges connect UMIs with an edit distance <= threshold (usually 1). The groups of reads are then defined from the network in a method-specific manner. For all the network-based methods, each read group is equivalent to one read count for the gene."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to say if that feature is wanted you could add it on in a new pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. I am making the changes you suggested and will push soon. Are we good to merge or should we wait for more feedback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
somewhere however, now incorrect QUAL characters are getting introduced. Brad have you seen this anywhere else on this branch? In hts_Overlapper |
I have not. Is this an issue present in the test dataset? Also, what is the
exact issue? Is it just characters some how being introduced or switched?
…On Thu, Jun 6, 2024 at 9:36 AM Matt Settles ***@***.***> wrote:
somewhere however, now incorrect QUAL characters are getting introduced.
Brad have you seen this anywhere else on this branch?
—
Reply to this email directly, view it on GitHub
<#261 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMME3Q53LFNI6KNAVFRA7T3ZGCFW3AVCNFSM6AAAAABI5CZPJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJSHE2TGMRZG4>
.
You are receiving this because you modified the open/close state.Message
ID: ***@***.***>
|
No in a large dataset, samtools fails with invalid QUAL character found.
Looking at the read in the bam file it show a bad character in the middle
of a read, only in overlapped SE reads though, Pairs work just fine.
I’m backing up to before you changes now to see if the error is repeated
there. And will report back
Matt
On June 6, 2024 at 9:40:37 AM, Bradley N. Jenner ***@***.***)
wrote:
I have not. Is this an issue present in the test dataset? Also, what is the
exact issue? Is it just characters some how being introduced or switched?
On Thu, Jun 6, 2024 at 9:36 AM Matt Settles ***@***.***> wrote:
somewhere however, now incorrect QUAL characters are getting introduced.
Brad have you seen this anywhere else on this branch?
—
Reply to this email directly, view it on GitHub
<#261 (comment)>, or
unsubscribe
<
.
You are receiving this because you modified the open/close state.Message
ID: ***@***.***>
—
Reply to this email directly, view it on GitHub
<#261 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAE6RZXSVXMPMGLVNXX2VBTZGCGILAVCNFSM6AAAAABI5CZPJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJSHE3DGMJWGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hello again,
I have added support for UMI-based PCR deduplication. It functions by just extracting the UMI from the read ID, converting it to bits, and appending it onto the sequence key used for PCR duplicate removal, essentially just extending the sequence. I also added an extra test function in hts_SuperDeduper_test that was useful during development. Additionally, I have ran this new version of the tool on a variety of datasets and it appears to work as intended. I originally wanted to provide some data for this PR, but ultimately I realized that testing fell more in line with bench marking then validating the algorithm. That being said, if you would like to see some of these results, I would be happy to add what I have found. Although, it is worth noting that the effectiveness of this method on single end reads, particularly TAGseq experiments with lower complexity, while comparable to umi_tools, is ultimately worse and for some reason very sensitive to whether or not you use hts_CutTrim before it. I am currently working on finding the best parameter settings for its application to TAGseq.
Anyways, let me know what you guys think, I am excited to finally have some eyes on this one.