Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for order flipping in SortingCollection used for MarkDuplicates #1945

Merged
merged 1 commit into from
Feb 29, 2024

Conversation

wook-choi
Copy link
Contributor

@wook-choi wook-choi commented Feb 22, 2024

Description

When number of reads in input file is over max value of signed int, the comparator for SortingCollection may return a value with sign opposite to the intended because the comparator calculates difference between two long numbers then down-casts into signed int.

compareDifference = (int) (lhs.read1IndexInFile - rhs.read1IndexInFile);

This comparison acts as a tie-score-breaker, so affects best read election among duplicate reads with same score.

Furthermore, because of this, same input file may result in different duplicate read set if run with different Java memory configuration - such as Xmx - which determines number of temp file chunks created by SortingCollection.
It is because order of element insertion into TreeSet which is internally used by SortingCollection for merging temp files varies by the number of temp files and the insertion order affects final order of duplicate reads.

I chose a high-depth sample with reads over 2,200,000,000, query-name sorted it in order to compare with Spark implementation and checked below cases using that sample.

  • Picard MarkDuplicates vs Picard Markduplicates with different Xmx --> DIFFERENT
  • Picard MarkDuplicates vs GATK MarkDuplicatesSpark --> DIFFERENT
  • GATK MarkDuplicatesSpark vs Picard MarkDuplicates with this fix --> IDENTICAL

Please let me know if something missed or I'm missing something.


Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on github actions

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

Copy link

@sanghopark-genome4me sanghopark-genome4me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@wook-choi
Copy link
Contributor Author

Hi guys, how can I request for review to maintainer?
No idea how to get review process started..

@kockan kockan self-requested a review February 29, 2024 16:41
Copy link
Contributor

@kockan kockan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, I'll go ahead and merge.

@kockan kockan merged commit 6fe8547 into broadinstitute:master Feb 29, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants