-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow column names in all dedup backends, and allow sorting by arbitrary columns #162
Conversation
quick Q - what would happen if a .pairs file lacks a header? would dedup fail entirely? If yes, could we (or, should we) allow both integer and string inputs? |
See Sasha's comment here #161 (comment) |
@golobor have you managed to try it with your data where you needed this? |
Yeah, it works, but seems surprisingly slow. Ankit is profiling dedup now,
we'll get back to you in a few days
…On Fri, Nov 25, 2022, 11:21 Ilya Flyamer ***@***.***> wrote:
@golobor <https://github.com/golobor> have you managed to try it with
your data where you needed this?
—
Reply to this email directly, view it on GitHub
<#162 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG64CVNZM4QDN5O42HQ5YDWKCHLZANCNFSM6AAAAAASCKOLSY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Did you remember to sort by the right columns? And I wonder then whether with very high % duplication it might be much slower... |
Hi, here are the profiler report for old and new way of deduplication, I think the long time is mainly due to more number of duplicates as mentioned in this thread, because everything else looks the same except the timings. I sorted it through pairtools sort, so it is not sorted w.r.t. to restriction fragment columns but ideally shouldn't it be same because its order will also be same as the location of the fragments? Also pairtools sort doesn't have an option to submit column names, but if needed I can sort it manually and then rerun the deduplication, let me know. here are the commands that I used: sname="nuclei_1_R1.lane1.ce11.01.sorted.pairs.restricted.gz"
python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
--max-mismatch 3 \
--mark-dups \
--output \
>( pairtools split \
--output-pairs test.nodups.pairs.gz \
--output-sam test.nodups.bam \
) \
--output-unmapped \
>( pairtools split \
--output-pairs test.unmapped.pairs.gz \
--output-sam test.unmapped.bam \
) \
--output-dups \
>( pairtools split \
--output-pairs test.dups.pairs.gz \
--output-sam test.dups.bam \
) \
--output-stats test.dedup.stats \
$sname > pairtools_profile_old.txt
echo "done with old way"
python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
--max-mismatch 3 \
--mark-dups \
--p1 "rfrag1"\
--p2 "rfrag2"\
--output \
>( pairtools split \
--output-pairs test.nodups.pairs.gz \
--output-sam test.nodups.bam \
) \
--output-unmapped \
>( pairtools split \
--output-pairs test.unmapped.pairs.gz \
--output-sam test.unmapped.bam \
) \
--output-dups \
>( pairtools split \
--output-pairs test.dups.pairs.gz \
--output-sam test.dups.bam \
) \
--output-stats test.dedup.stats \
$sname > pairtools_profile_new.txt
I have also attached the bash script for both the command that I used and a sample pairs file too (I generated it manually using pandas and sorting w.r.t. [chrom1,chrom2,pos1,pos2], as I think pairtools sort, because head command was only giving me multimappers and walks), if you want to test something else please feel free to ask. pairs_sample.pairs.txt |
Wow that is a huge slow-down! Thanks for the report. Would be good to profile the KDTree itself in regard to its performance with different % duplicates... Ah and by the way, when using restriction fragments, I would use For the sorting, it should be the same, you are right! But perhaps we should allow column names in sort too, anyway. |
Ah, what are the actual numbers that you get with the two approaches? I.e. how many duplicates, and total pairs? |
Hi, here are the stats and profiler files for both the cases, this time I used pairtools_profile_new.txt |
Thank you! Yeah, could you share the full file please? |
A small clarification - we've been using non zero max mismatch, because
these are single cell data generated with phi29 followed by sonication
…On Mon, Nov 28, 2022, 13:45 Ilya Flyamer ***@***.***> wrote:
Thank you! Yeah, could you share the full file please?
—
Reply to this email directly, view it on GitHub
<#162 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG64CXJZ4ZKXI2ZCNFW3QDWKSSPRANCNFSM6AAAAAASCKOLSY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Why would you do this when you are using annotated restrictions sites?.. |
@MasterChief1O7 can you try just running the "new" version with |
It is a temporary solution in the long run, but it works for me, and it's even faster than the "old" way. |
wow! true, just tried and it worked, just one small thing that I noticed, stats are slightly different, just by 100 I guess, is it an issue? But, ya, some of the stats are quite different (like the cis_ ones) I just added cython and nothing else sname="nuclei_1_R1.lane1.ce11.01.sorted.pairs.restricted.gz"
python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
--max-mismatch 3 \
--mark-dups \
--backend cython \
--output \
>( pairtools split \
--output-pairs pairtools_old_cython.nodups.pairs.gz \
--output-sam pairtools_old_cython.nodups.bam \
) \
--output-unmapped \
>( pairtools split \
--output-pairs pairtools_old_cython.unmapped.pairs.gz \
--output-sam pairtools_old_cython.unmapped.bam \
) \
--output-dups \
>( pairtools split \
--output-pairs pairtools_old_cython.dups.pairs.gz \
--output-sam pairtools_old_cython.dups.bam \
) \
--output-stats pairtools_old_cython.dedup.stats \
$sname > pairtools_profile_old_cython.txt
echo "done with old way"
python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
--max-mismatch 0 \
--mark-dups \
--backend cython \
--p1 "rfrag1"\
--p2 "rfrag2"\
--output \
>( pairtools split \
--output-pairs pairtools_new_cython.nodups.pairs.gz \
--output-sam pairtools_new_cython.nodups.bam \
) \
--output-unmapped \
>( pairtools split \
--output-pairs pairtools_new_cython.unmapped.pairs.gz \
--output-sam pairtools_new_cython.unmapped.bam \
) \
--output-dups \
>( pairtools split \
--output-pairs pairtools_new_cython.dups.pairs.gz \
--output-sam pairtools_new_cython.dups.bam \
) \
--output-stats pairtools_new_cython.dedup.stats \
$sname > pairtools_profile_new_cython.txt |
Mh with --max-mistmatch 0 there shouldn't be any difference I think... Such a huge effect on cis_Xkb+ is weird! It might be saving the columns in the wrong order or something like that... |
b/c the site where the ligation junction is typically not sequenced ("unconfirmed" ligations, as we call them in a parallel discussion), so we can't be sure where exactly restriction happened. With 4bp cutters, consecutive restriction sites can be as close as 10s of bp apart... |
Ok, then actually, you could use only confirmed ligations! That was Sasha's @agalitsyna idea, which really made the data much cleaner. |
It's weird that the total number of cis interactions has not really changed (+100), but all cis_X+ has dropped substantially. Looks like one of that backends might interact differently with PairCounter that is producing stats, doesn't it? |
@Phlya It might be happening here: pairtools/pairtools/lib/dedup.py Line 436 in 3bfe8e3
p1 is passed as position to cython backend, while pandas-based backend always adds "pos1" and "pos2" to the PairCounter here: pairtools/pairtools/lib/stats.py Line 469 in 6d5b4d4
|
Yeah that's exactly the kind of thing I was thinking about, but haven't managed to look for it! Thanks, I'll investigate/fix this. |
Use them for what?.. I'm not sure if I follow - we already have rather little data due to single cell resolution, it doesn't feel like a good idea to throw away a major fraction of it... |
I guess depends on how deeply you sequenced your libraries, but generally since it's amplified with phi29 and randomly fragmented if you sequence deep enough you will directly read through every ligation that happened. Then you don't actually lose any data, you are just sure every contact you describe comes from ligation and not from polymerase hopping (since you actually filter by presence of ligation site sequence at the apparent ligation junction). |
Fixed now! @MasterChief1O7 |
And also found a small preexisting bug where dedup was losing a small portion of pairs, fixed here... |
Actually there are still very strange discrepancies between cython and scipy/sklearn even in just mapping stats... which is bizarre! cython
scipy
|
@Phlya , sorry for a long break - what's the status of this? Is it ready to be merged? |
Oh I also don't remember any more @golobor . I think the functionality here was all good to go, but there were some weird things I discovered that are described above... |
So, indeed, some lines become duplicated in the scipy engine:
|
Always? Or after this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all is well now
First try to fix #161
Now all backends support column names (not numbers even in cython) to define used chrom/pos/strand columns.