Fix for order flipping in SortingCollection used for MarkDuplicates #1945
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
When number of reads in input file is over max value of signed int, the comparator for SortingCollection may return a value with sign opposite to the intended because the comparator calculates difference between two long numbers then down-casts into signed int.
compareDifference = (int) (lhs.read1IndexInFile - rhs.read1IndexInFile);
This comparison acts as a tie-score-breaker, so affects best read election among duplicate reads with same score.
Furthermore, because of this, same input file may result in different duplicate read set if run with different Java memory configuration - such as Xmx - which determines number of temp file chunks created by SortingCollection.
It is because order of element insertion into TreeSet which is internally used by SortingCollection for merging temp files varies by the number of temp files and the insertion order affects final order of duplicate reads.
I chose a high-depth sample with reads over 2,200,000,000, query-name sorted it in order to compare with Spark implementation and checked below cases using that sample.
Please let me know if something missed or I'm missing something.
Checklist (never delete this)
Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.
Content
Review
For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests