Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow column names in all dedup backends, and allow sorting by arbitrary columns #162

Merged
merged 27 commits into from
Mar 17, 2024

Conversation

Phlya
Copy link
Member

@Phlya Phlya commented Nov 16, 2022

First try to fix #161

Now all backends support column names (not numbers even in cython) to define used chrom/pos/strand columns.

@Phlya Phlya requested review from agalitsyna and golobor November 16, 2022 15:00
@Phlya Phlya changed the title Allow column names in all backends Allow column names in all dedup backends Nov 22, 2022
@golobor
Copy link
Member

golobor commented Nov 22, 2022

quick Q - what would happen if a .pairs file lacks a header? would dedup fail entirely? If yes, could we (or, should we) allow both integer and string inputs?

@Phlya
Copy link
Member Author

Phlya commented Nov 22, 2022

See Sasha's comment here #161 (comment)

@Phlya
Copy link
Member Author

Phlya commented Nov 25, 2022

@golobor have you managed to try it with your data where you needed this?

@golobor
Copy link
Member

golobor commented Nov 25, 2022 via email

@Phlya
Copy link
Member Author

Phlya commented Nov 25, 2022

Did you remember to sort by the right columns? And I wonder then whether with very high % duplication it might be much slower...

@MasterChief1O7
Copy link

Hi, here are the profiler report for old and new way of deduplication, I think the long time is mainly due to more number of duplicates as mentioned in this thread, because everything else looks the same except the timings.

I sorted it through pairtools sort, so it is not sorted w.r.t. to restriction fragment columns but ideally shouldn't it be same because its order will also be same as the location of the fragments? Also pairtools sort doesn't have an option to submit column names, but if needed I can sort it manually and then rerun the deduplication, let me know.

here are the commands that I used:

sname="nuclei_1_R1.lane1.ce11.01.sorted.pairs.restricted.gz"

python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
    --max-mismatch 3 \
    --mark-dups \
    --output \
        >( pairtools split \
            --output-pairs test.nodups.pairs.gz \
            --output-sam test.nodups.bam \
         ) \
    --output-unmapped \
        >( pairtools split \
            --output-pairs test.unmapped.pairs.gz \
            --output-sam test.unmapped.bam \
         ) \
    --output-dups \
        >( pairtools split \
            --output-pairs test.dups.pairs.gz \
            --output-sam test.dups.bam \
         ) \
    --output-stats test.dedup.stats \
    $sname > pairtools_profile_old.txt
    
echo "done with old way"

python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
    --max-mismatch 3 \
    --mark-dups \
    --p1 "rfrag1"\
    --p2 "rfrag2"\
    --output \
        >( pairtools split \
            --output-pairs test.nodups.pairs.gz \
            --output-sam test.nodups.bam \
         ) \
    --output-unmapped \
        >( pairtools split \
            --output-pairs test.unmapped.pairs.gz \
            --output-sam test.unmapped.bam \
         ) \
    --output-dups \
        >( pairtools split \
            --output-pairs test.dups.pairs.gz \
            --output-sam test.dups.bam \
         ) \
    --output-stats test.dedup.stats \
    $sname > pairtools_profile_new.txt
    

I have also attached the bash script for both the command that I used and a sample pairs file too (I generated it manually using pandas and sorting w.r.t. [chrom1,chrom2,pos1,pos2], as I think pairtools sort, because head command was only giving me multimappers and walks), if you want to test something else please feel free to ask.

pairs_sample.pairs.txt
pairtools_profile_new.txt
pairtools_profile_old.txt
dedup_pairsam.sh.txt

@Phlya
Copy link
Member Author

Phlya commented Nov 28, 2022

Wow that is a huge slow-down! Thanks for the report. Would be good to profile the KDTree itself in regard to its performance with different % duplicates...

Ah and by the way, when using restriction fragments, I would use --max-mismatch 0 - you don't want to consider neighbouring fragments as duplicates. That in itself might help a little, by the way.

For the sorting, it should be the same, you are right! But perhaps we should allow column names in sort too, anyway.

@Phlya
Copy link
Member Author

Phlya commented Nov 28, 2022

Ah, what are the actual numbers that you get with the two approaches? I.e. how many duplicates, and total pairs?

@MasterChief1O7
Copy link

Hi, here are the stats and profiler files for both the cases, this time I used --max-mixmatch 0 for the new case, it did reduce the time but still quite long as compared to the old method. Also if you need the whole pairs file let me know, I can share on slack.

pairtools_profile_new.txt
pairtools_profile_old.txt
pairtools_new.dedup.stats.txt
pairtools_old.dedup.stats.txt

@Phlya
Copy link
Member Author

Phlya commented Nov 28, 2022

Thank you! Yeah, could you share the full file please?

@golobor
Copy link
Member

golobor commented Nov 28, 2022 via email

@Phlya
Copy link
Member Author

Phlya commented Nov 28, 2022

Why would you do this when you are using annotated restrictions sites?..

@Phlya
Copy link
Member Author

Phlya commented Nov 28, 2022

@MasterChief1O7 can you try just running the "new" version with --backend cython?

@Phlya
Copy link
Member Author

Phlya commented Nov 28, 2022

It is a temporary solution in the long run, but it works for me, and it's even faster than the "old" way.

@MasterChief1O7
Copy link

wow! true, just tried and it worked, just one small thing that I noticed, stats are slightly different, just by 100 I guess, is it an issue? But, ya, some of the stats are quite different (like the cis_ ones)

I just added cython and nothing else

sname="nuclei_1_R1.lane1.ce11.01.sorted.pairs.restricted.gz"

python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
    --max-mismatch 3 \
    --mark-dups \
    --backend cython \
    --output \
        >( pairtools split \
            --output-pairs pairtools_old_cython.nodups.pairs.gz \
            --output-sam pairtools_old_cython.nodups.bam \
         ) \
    --output-unmapped \
        >( pairtools split \
            --output-pairs pairtools_old_cython.unmapped.pairs.gz \
            --output-sam pairtools_old_cython.unmapped.bam \
         ) \
    --output-dups \
        >( pairtools split \
            --output-pairs pairtools_old_cython.dups.pairs.gz \
            --output-sam pairtools_old_cython.dups.bam \
         ) \
    --output-stats pairtools_old_cython.dedup.stats \
    $sname > pairtools_profile_old_cython.txt
    
echo "done with old way"

python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
    --max-mismatch 0 \
    --mark-dups \
    --backend cython \
    --p1 "rfrag1"\
    --p2 "rfrag2"\
    --output \
        >( pairtools split \
            --output-pairs pairtools_new_cython.nodups.pairs.gz \
            --output-sam pairtools_new_cython.nodups.bam \
         ) \
    --output-unmapped \
        >( pairtools split \
            --output-pairs pairtools_new_cython.unmapped.pairs.gz \
            --output-sam pairtools_new_cython.unmapped.bam \
         ) \
    --output-dups \
        >( pairtools split \
            --output-pairs pairtools_new_cython.dups.pairs.gz \
            --output-sam pairtools_new_cython.dups.bam \
         ) \
    --output-stats pairtools_new_cython.dedup.stats \
    $sname > pairtools_profile_new_cython.txt

grafik

@Phlya
Copy link
Member Author

Phlya commented Nov 28, 2022

Mh with --max-mistmatch 0 there shouldn't be any difference I think... Such a huge effect on cis_Xkb+ is weird! It might be saving the columns in the wrong order or something like that...

@golobor
Copy link
Member

golobor commented Nov 28, 2022

Why would you do this when you are using annotated restrictions sites?..

b/c the site where the ligation junction is typically not sequenced ("unconfirmed" ligations, as we call them in a parallel discussion), so we can't be sure where exactly restriction happened. With 4bp cutters, consecutive restriction sites can be as close as 10s of bp apart...

@Phlya
Copy link
Member Author

Phlya commented Nov 28, 2022

Ok, then actually, you could use only confirmed ligations! That was Sasha's @agalitsyna idea, which really made the data much cleaner.

@agalitsyna
Copy link
Member

It's weird that the total number of cis interactions has not really changed (+100), but all cis_X+ has dropped substantially. Looks like one of that backends might interact differently with PairCounter that is producing stats, doesn't it?

@agalitsyna
Copy link
Member

@Phlya It might be happening here:

int(cols[p1ind]),

p1 is passed as position to cython backend, while pandas-based backend always adds "pos1" and "pos2" to the PairCounter here:
def add_pairs_from_dataframe(self, df, unmapped_chrom="!"):

@Phlya
Copy link
Member Author

Phlya commented Nov 28, 2022

Yeah that's exactly the kind of thing I was thinking about, but haven't managed to look for it! Thanks, I'll investigate/fix this.

@golobor golobor closed this Nov 28, 2022
@golobor golobor reopened this Nov 28, 2022
@golobor
Copy link
Member

golobor commented Nov 28, 2022

Ok, then actually, you could use only confirmed ligations!

Use them for what?.. I'm not sure if I follow - we already have rather little data due to single cell resolution, it doesn't feel like a good idea to throw away a major fraction of it...

@Phlya
Copy link
Member Author

Phlya commented Nov 28, 2022

Ok, then actually, you could use only confirmed ligations!

Use them for what?.. I'm not sure if I follow - we already have rather little data due to single cell resolution, it doesn't feel like a good idea to throw away a major fraction of it...

I guess depends on how deeply you sequenced your libraries, but generally since it's amplified with phi29 and randomly fragmented if you sequence deep enough you will directly read through every ligation that happened. Then you don't actually lose any data, you are just sure every contact you describe comes from ligation and not from polymerase hopping (since you actually filter by presence of ligation site sequence at the apparent ligation junction).

@Phlya
Copy link
Member Author

Phlya commented Dec 2, 2022

Fixed now! @MasterChief1O7

@Phlya
Copy link
Member Author

Phlya commented Dec 2, 2022

And also found a small preexisting bug where dedup was losing a small portion of pairs, fixed here...

@Phlya
Copy link
Member Author

Phlya commented Dec 2, 2022

Actually there are still very strange discrepancies between cython and scipy/sklearn even in just mapping stats... which is bizarre!

cython

total	1498859
total_unmapped	50862
total_single_sided_mapped	63125
total_mapped	1384872
total_dups	1304062
total_nodups	80810
cis	71230
trans	9580

scipy

total	1498859
total_unmapped	50862
total_single_sided_mapped	62525
total_mapped	1385472
total_dups	1304062
total_nodups	81410
cis	71830
trans	9580

@Phlya
Copy link
Member Author

Phlya commented Dec 6, 2022

And the pair types also differ just slightly...

Screenshot 2022-12-06 at 15 39 33

@Phlya Phlya changed the title Allow column names in all dedup backends Allow column names in all dedup backends, and allow sorting by arbitrary columns Dec 6, 2022
@golobor
Copy link
Member

golobor commented Mar 9, 2024

@Phlya , sorry for a long break - what's the status of this? Is it ready to be merged?

@Phlya
Copy link
Member Author

Phlya commented Mar 9, 2024

Oh I also don't remember any more @golobor . I think the functionality here was all good to go, but there were some weird things I discovered that are described above...

@golobor
Copy link
Member

golobor commented Mar 16, 2024

So, indeed, some lines become duplicated in the scipy engine:

(main) [anton.goloborodko@clip-login-0 test_dedup]$ cat pairs_sample.pairs.dedup_scipy.csv | grep NB551430:778:HNLWKBGXM:3:12508:20725:4668
NB551430:778:HNLWKBGXM:3:12508:20725:4668       chrV    7125494 chrV    7125100 -       +       UU      60      60      19523   7123598 7126019 19523   7123598 7126019
NB551430:778:HNLWKBGXM:3:12508:20725:4668       chrV    7125494 chrV    7125100 -       +       UU      60      60      19523   7123598 7126019 19523   7123598 7126019

@Phlya
Copy link
Member Author

Phlya commented Mar 16, 2024

Always? Or after this PR?

Copy link
Member

@golobor golobor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all is well now

@golobor golobor merged commit 91b40fd into master Mar 17, 2024
5 checks passed
@golobor golobor deleted the dedup-cols branch March 17, 2024 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow arbitrary columns in dedup
4 participants