Allow column names in all dedup backends, and allow sorting by arbitrary columns #162

Phlya · 2022-11-16T14:59:47Z

First try to fix #161

Now all backends support column names (not numbers even in cython) to define used chrom/pos/strand columns.

golobor · 2022-11-22T16:05:34Z

quick Q - what would happen if a .pairs file lacks a header? would dedup fail entirely? If yes, could we (or, should we) allow both integer and string inputs?

Phlya · 2022-11-22T16:10:13Z

See Sasha's comment here #161 (comment)

Phlya · 2022-11-25T10:21:37Z

@golobor have you managed to try it with your data where you needed this?

golobor · 2022-11-25T17:00:16Z

Yeah, it works, but seems surprisingly slow. Ankit is profiling dedup now, we'll get back to you in a few days

…

On Fri, Nov 25, 2022, 11:21 Ilya Flyamer ***@***.***> wrote: @golobor <https://github.com/golobor> have you managed to try it with your data where you needed this? — Reply to this email directly, view it on GitHub <#162 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG64CVNZM4QDN5O42HQ5YDWKCHLZANCNFSM6AAAAAASCKOLSY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Phlya · 2022-11-25T18:55:23Z

Did you remember to sort by the right columns? And I wonder then whether with very high % duplication it might be much slower...

MasterChief1O7 · 2022-11-28T10:50:07Z

Hi, here are the profiler report for old and new way of deduplication, I think the long time is mainly due to more number of duplicates as mentioned in this thread, because everything else looks the same except the timings.

I sorted it through pairtools sort, so it is not sorted w.r.t. to restriction fragment columns but ideally shouldn't it be same because its order will also be same as the location of the fragments? Also pairtools sort doesn't have an option to submit column names, but if needed I can sort it manually and then rerun the deduplication, let me know.

here are the commands that I used:

sname="nuclei_1_R1.lane1.ce11.01.sorted.pairs.restricted.gz"

python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
    --max-mismatch 3 \
    --mark-dups \
    --output \
        >( pairtools split \
            --output-pairs test.nodups.pairs.gz \
            --output-sam test.nodups.bam \
         ) \
    --output-unmapped \
        >( pairtools split \
            --output-pairs test.unmapped.pairs.gz \
            --output-sam test.unmapped.bam \
         ) \
    --output-dups \
        >( pairtools split \
            --output-pairs test.dups.pairs.gz \
            --output-sam test.dups.bam \
         ) \
    --output-stats test.dedup.stats \
    $sname > pairtools_profile_old.txt
    
echo "done with old way"

python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
    --max-mismatch 3 \
    --mark-dups \
    --p1 "rfrag1"\
    --p2 "rfrag2"\
    --output \
        >( pairtools split \
            --output-pairs test.nodups.pairs.gz \
            --output-sam test.nodups.bam \
         ) \
    --output-unmapped \
        >( pairtools split \
            --output-pairs test.unmapped.pairs.gz \
            --output-sam test.unmapped.bam \
         ) \
    --output-dups \
        >( pairtools split \
            --output-pairs test.dups.pairs.gz \
            --output-sam test.dups.bam \
         ) \
    --output-stats test.dedup.stats \
    $sname > pairtools_profile_new.txt

I have also attached the bash script for both the command that I used and a sample pairs file too (I generated it manually using pandas and sorting w.r.t. [chrom1,chrom2,pos1,pos2], as I think pairtools sort, because head command was only giving me multimappers and walks), if you want to test something else please feel free to ask.

pairs_sample.pairs.txt
pairtools_profile_new.txt
pairtools_profile_old.txt
dedup_pairsam.sh.txt

Phlya · 2022-11-28T11:00:04Z

Wow that is a huge slow-down! Thanks for the report. Would be good to profile the KDTree itself in regard to its performance with different % duplicates...

Ah and by the way, when using restriction fragments, I would use --max-mismatch 0 - you don't want to consider neighbouring fragments as duplicates. That in itself might help a little, by the way.

For the sorting, it should be the same, you are right! But perhaps we should allow column names in sort too, anyway.

Phlya · 2022-11-28T11:01:03Z

Ah, what are the actual numbers that you get with the two approaches? I.e. how many duplicates, and total pairs?

MasterChief1O7 · 2022-11-28T12:20:00Z

Hi, here are the stats and profiler files for both the cases, this time I used --max-mixmatch 0 for the new case, it did reduce the time but still quite long as compared to the old method. Also if you need the whole pairs file let me know, I can share on slack.

pairtools_profile_new.txt
pairtools_profile_old.txt
pairtools_new.dedup.stats.txt
pairtools_old.dedup.stats.txt

Phlya · 2022-11-28T12:45:34Z

Thank you! Yeah, could you share the full file please?

golobor · 2022-11-28T13:01:38Z

A small clarification - we've been using non zero max mismatch, because these are single cell data generated with phi29 followed by sonication

…

On Mon, Nov 28, 2022, 13:45 Ilya Flyamer ***@***.***> wrote: Thank you! Yeah, could you share the full file please? — Reply to this email directly, view it on GitHub <#162 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG64CXJZ4ZKXI2ZCNFW3QDWKSSPRANCNFSM6AAAAAASCKOLSY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Phlya · 2022-11-28T13:03:41Z

Why would you do this when you are using annotated restrictions sites?..

Phlya · 2022-11-28T13:25:08Z

@MasterChief1O7 can you try just running the "new" version with --backend cython?

Phlya · 2022-11-28T13:30:03Z

It is a temporary solution in the long run, but it works for me, and it's even faster than the "old" way.

MasterChief1O7 · 2022-11-28T13:47:50Z

wow! true, just tried and it worked, just one small thing that I noticed, stats are slightly different, just by 100 I guess, is it an issue? But, ya, some of the stats are quite different (like the cis_ ones)

I just added cython and nothing else

sname="nuclei_1_R1.lane1.ce11.01.sorted.pairs.restricted.gz"

python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
    --max-mismatch 3 \
    --mark-dups \
    --backend cython \
    --output \
        >( pairtools split \
            --output-pairs pairtools_old_cython.nodups.pairs.gz \
            --output-sam pairtools_old_cython.nodups.bam \
         ) \
    --output-unmapped \
        >( pairtools split \
            --output-pairs pairtools_old_cython.unmapped.pairs.gz \
            --output-sam pairtools_old_cython.unmapped.bam \
         ) \
    --output-dups \
        >( pairtools split \
            --output-pairs pairtools_old_cython.dups.pairs.gz \
            --output-sam pairtools_old_cython.dups.bam \
         ) \
    --output-stats pairtools_old_cython.dedup.stats \
    $sname > pairtools_profile_old_cython.txt
    
echo "done with old way"

python -m profile -s cumtime ~/.conda/envs/pairtools_test/bin/pairtools dedup \
    --max-mismatch 0 \
    --mark-dups \
    --backend cython \
    --p1 "rfrag1"\
    --p2 "rfrag2"\
    --output \
        >( pairtools split \
            --output-pairs pairtools_new_cython.nodups.pairs.gz \
            --output-sam pairtools_new_cython.nodups.bam \
         ) \
    --output-unmapped \
        >( pairtools split \
            --output-pairs pairtools_new_cython.unmapped.pairs.gz \
            --output-sam pairtools_new_cython.unmapped.bam \
         ) \
    --output-dups \
        >( pairtools split \
            --output-pairs pairtools_new_cython.dups.pairs.gz \
            --output-sam pairtools_new_cython.dups.bam \
         ) \
    --output-stats pairtools_new_cython.dedup.stats \
    $sname > pairtools_profile_new_cython.txt

Phlya · 2022-11-28T13:51:25Z

Mh with --max-mistmatch 0 there shouldn't be any difference I think... Such a huge effect on cis_Xkb+ is weird! It might be saving the columns in the wrong order or something like that...

golobor · 2022-11-28T13:58:35Z

Why would you do this when you are using annotated restrictions sites?..

b/c the site where the ligation junction is typically not sequenced ("unconfirmed" ligations, as we call them in a parallel discussion), so we can't be sure where exactly restriction happened. With 4bp cutters, consecutive restriction sites can be as close as 10s of bp apart...

Phlya · 2022-11-28T14:14:16Z

Ok, then actually, you could use only confirmed ligations! That was Sasha's @agalitsyna idea, which really made the data much cleaner.

agalitsyna · 2022-11-28T16:26:19Z

It's weird that the total number of cis interactions has not really changed (+100), but all cis_X+ has dropped substantially. Looks like one of that backends might interact differently with PairCounter that is producing stats, doesn't it?

agalitsyna · 2022-11-28T16:32:05Z

@Phlya It might be happening here:

pairtools/pairtools/lib/dedup.py

Line 436 in 3bfe8e3

int(cols[p1ind]),

p1 is passed as position to cython backend, while pandas-based backend always adds "pos1" and "pos2" to the PairCounter here:

pairtools/pairtools/lib/stats.py

Line 469 in 6d5b4d4

def add_pairs_from_dataframe(self, df, unmapped_chrom="!"):

Phlya · 2022-11-28T17:51:12Z

Yeah that's exactly the kind of thing I was thinking about, but haven't managed to look for it! Thanks, I'll investigate/fix this.

golobor · 2022-11-28T18:45:21Z

Ok, then actually, you could use only confirmed ligations!

Use them for what?.. I'm not sure if I follow - we already have rather little data due to single cell resolution, it doesn't feel like a good idea to throw away a major fraction of it...

Phlya · 2022-11-28T18:52:31Z

Ok, then actually, you could use only confirmed ligations!

Use them for what?.. I'm not sure if I follow - we already have rather little data due to single cell resolution, it doesn't feel like a good idea to throw away a major fraction of it...

I guess depends on how deeply you sequenced your libraries, but generally since it's amplified with phi29 and randomly fragmented if you sequence deep enough you will directly read through every ligation that happened. Then you don't actually lose any data, you are just sure every contact you describe comes from ligation and not from polymerase hopping (since you actually filter by presence of ligation site sequence at the apparent ligation junction).

Phlya · 2022-12-02T15:39:05Z

Fixed now! @MasterChief1O7

Phlya · 2022-12-02T16:31:40Z

And also found a small preexisting bug where dedup was losing a small portion of pairs, fixed here...

Phlya · 2022-12-02T16:42:59Z

Actually there are still very strange discrepancies between cython and scipy/sklearn even in just mapping stats... which is bizarre!

cython

total	1498859
total_unmapped	50862
total_single_sided_mapped	63125
total_mapped	1384872
total_dups	1304062
total_nodups	80810
cis	71230
trans	9580

scipy

total	1498859
total_unmapped	50862
total_single_sided_mapped	62525
total_mapped	1385472
total_dups	1304062
total_nodups	81410
cis	71830
trans	9580

Phlya · 2022-12-06T14:40:10Z

And the pair types also differ just slightly...

golobor · 2024-03-09T15:05:15Z

@Phlya , sorry for a long break - what's the status of this? Is it ready to be merged?

Phlya · 2024-03-09T15:07:27Z

Oh I also don't remember any more @golobor . I think the functionality here was all good to go, but there were some weird things I discovered that are described above...

golobor · 2024-03-16T14:29:17Z

So, indeed, some lines become duplicated in the scipy engine:

(main) [anton.goloborodko@clip-login-0 test_dedup]$ cat pairs_sample.pairs.dedup_scipy.csv | grep NB551430:778:HNLWKBGXM:3:12508:20725:4668
NB551430:778:HNLWKBGXM:3:12508:20725:4668       chrV    7125494 chrV    7125100 -       +       UU      60      60      19523   7123598 7126019 19523   7123598 7126019
NB551430:778:HNLWKBGXM:3:12508:20725:4668       chrV    7125494 chrV    7125100 -       +       UU      60      60      19523   7123598 7126019 19523   7123598 7126019

Phlya · 2024-03-16T15:25:15Z

Always? Or after this PR?

…ng order

golobor

all is well now

Allow column names in all backends

e0d0223

Phlya requested review from agalitsyna and golobor November 16, 2022 15:00

Add new forgotten test file

3bfe8e3

Phlya changed the title ~~Allow column names in all backends~~ Allow column names in all dedup backends Nov 22, 2022

golobor closed this Nov 28, 2022

golobor reopened this Nov 28, 2022

Fix!

ea98bce

Don't lose some pairs!

3d4d338

Phlya added 3 commits December 6, 2022 16:23

add unmapped_chrom arg to cython stats

c8f7eed

sort by arbitrary columns

7a4d982

extra cols in sort (+rewrite command w f-strings)

314e0b2

Phlya changed the title ~~Allow column names in all dedup backends~~ Allow column names in all dedup backends, and allow sorting by arbitrary columns Dec 6, 2022

Phlya and others added 6 commits December 6, 2022 17:29

isnumeric checking in case of 10 or more columns

fa5d165

Fix and simplify

af709b0

remove unnecessary print

16669ff

Add dtypes to pairsam format, use them in sort

810b49d

Merge branch 'master' into dedup-cols

ce888c1

Merge branch 'master' into dedup-cols

c37a6df

Merge branch 'master' into dedup-cols

ca277dc

golobor and others added 6 commits March 16, 2024 16:36

dedup: factor out _make_adj_mat

1a2ad29

lib/dedup.py: minor cleanup, rename variables

e595d86

lib/dedup.py: improve readability

0f9842c

lib/dedup.py even more refactoring

49556a7

lib/dedup.py tag carryover explicitly instead of relying on the sorti…

dbc9569

…ng order

Fix path to empty file?

75cc094

golobor approved these changes Mar 17, 2024

View reviewed changes

golobor merged commit 91b40fd into master Mar 17, 2024
5 checks passed

golobor deleted the dedup-cols branch March 17, 2024 13:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow column names in all dedup backends, and allow sorting by arbitrary columns #162

Allow column names in all dedup backends, and allow sorting by arbitrary columns #162

Phlya commented Nov 16, 2022

golobor commented Nov 22, 2022

Phlya commented Nov 22, 2022

Phlya commented Nov 25, 2022

golobor commented Nov 25, 2022 via email

Phlya commented Nov 25, 2022

MasterChief1O7 commented Nov 28, 2022

Phlya commented Nov 28, 2022

Phlya commented Nov 28, 2022

MasterChief1O7 commented Nov 28, 2022

Phlya commented Nov 28, 2022

golobor commented Nov 28, 2022 via email

Phlya commented Nov 28, 2022

Phlya commented Nov 28, 2022

Phlya commented Nov 28, 2022

MasterChief1O7 commented Nov 28, 2022

Phlya commented Nov 28, 2022

golobor commented Nov 28, 2022

Phlya commented Nov 28, 2022

agalitsyna commented Nov 28, 2022

agalitsyna commented Nov 28, 2022

Phlya commented Nov 28, 2022 •

edited

Loading

golobor commented Nov 28, 2022

Phlya commented Nov 28, 2022

Phlya commented Dec 2, 2022

Phlya commented Dec 2, 2022

Phlya commented Dec 2, 2022

Phlya commented Dec 6, 2022

golobor commented Mar 9, 2024

Phlya commented Mar 9, 2024

golobor commented Mar 16, 2024

Phlya commented Mar 16, 2024

golobor left a comment

Allow column names in all dedup backends, and allow sorting by arbitrary columns #162

Allow column names in all dedup backends, and allow sorting by arbitrary columns #162

Conversation

Phlya commented Nov 16, 2022

golobor commented Nov 22, 2022

Phlya commented Nov 22, 2022

Phlya commented Nov 25, 2022

golobor commented Nov 25, 2022 via email

Phlya commented Nov 25, 2022

MasterChief1O7 commented Nov 28, 2022

Phlya commented Nov 28, 2022

Phlya commented Nov 28, 2022

MasterChief1O7 commented Nov 28, 2022

Phlya commented Nov 28, 2022

golobor commented Nov 28, 2022 via email

Phlya commented Nov 28, 2022

Phlya commented Nov 28, 2022

Phlya commented Nov 28, 2022

MasterChief1O7 commented Nov 28, 2022

Phlya commented Nov 28, 2022

golobor commented Nov 28, 2022

Phlya commented Nov 28, 2022

agalitsyna commented Nov 28, 2022

agalitsyna commented Nov 28, 2022

Phlya commented Nov 28, 2022 • edited Loading

golobor commented Nov 28, 2022

Phlya commented Nov 28, 2022

Phlya commented Dec 2, 2022

Phlya commented Dec 2, 2022

Phlya commented Dec 2, 2022

Phlya commented Dec 6, 2022

golobor commented Mar 9, 2024

Phlya commented Mar 9, 2024

golobor commented Mar 16, 2024

Phlya commented Mar 16, 2024

golobor left a comment

Choose a reason for hiding this comment

Phlya commented Nov 28, 2022 •

edited

Loading