-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated MarkDuplicates Scoring and Comparison code to reflect mismatches to picard #5023
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5023 +/- ##
==============================================
- Coverage 86.49% 86.488% -0.002%
- Complexity 29203 29270 +67
==============================================
Files 1814 1814
Lines 135364 135570 +206
Branches 15042 15069 +27
==============================================
+ Hits 117077 117252 +175
- Misses 12826 12848 +22
- Partials 5461 5470 +9
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments about comments and names. Should there be a test for the changes in this branch? It looks like the test is dealing with the changes from the previous branch.
@@ -502,10 +502,7 @@ private PairedEndsCoordinateComparator() { } | |||
|
|||
@Override | |||
public int compare( TransientFieldPhysicalLocation first, TransientFieldPhysicalLocation second ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird that this is called PairedEndsCoordiantComparator
and it no longer compares PairedEnds
or coordinates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for this is woefully out of date now.
Utils.nonNull(read); | ||
return scoring.applyAsInt(read); | ||
// We max out at Short.MAX_VALUE / 2 to prevent overflow if we add two very high quality/length reads into pairs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the negative value if it fails vendor quality is strange, maybe mention that in a comment?
@@ -474,38 +480,43 @@ public static void saveMetricsRDD(final MetricsFile<GATKDuplicationMetrics, Doub | |||
} | |||
|
|||
/** | |||
* Comparator for sorting Reads by coordinate. Note that a header is required in | |||
* order to meaningfully compare contigs. | |||
* Comparator for by their PhysicalLoccation attributes and strandedness. This comparator is intended to serve as a tiebreaker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo, Loccation
* It compares two pairedEnds objects by their first reads according to first their clipped start positions | ||
* (they were matched together based on UnclippedStartOriginally), then the orientation of the strand, followed by | ||
* the readname lexicographical order. | ||
* Ordering is not the same as {@link htsjdk.samtools.SAMRecordCoordinateComparator}, except for the alignment position. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is still out of date, it doesn't look at the alignment position anymore
12f4487
to
8c978ad
Compare
@@ -480,23 +480,20 @@ public static void saveMetricsRDD(final MetricsFile<GATKDuplicationMetrics, Doub | |||
} | |||
|
|||
/** | |||
* Comparator for by their PhysicalLoccation attributes and strandedness. This comparator is intended to serve as a tiebreaker | |||
* Comparator for by their PhysicalLocation attributes and strandedness. This comparator is intended to serve as a tiebreaker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for by their
f258774
to
81bcdeb
Compare
… start position of mismatching reads. Additionally changed the score to be a double packed short value in order to better reflect picard scoring code.
81bcdeb
to
5db98c8
Compare
@lbergelson Removed the broken tests. The problem stemmed from markDuplicatesGATK being out of date. Since we are removing to remove it soon anyway i figured we would have the test only apply for markDuplicatesSpark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery Sounds good 👍
After delving into picard and MarkDuplicates I have found two more sources of inconsistencies, first we were preserving the original mapped start position which we used for tie-breaking. This was unnecessary and caused problems at sites with very high depth. Additionally, our scoring code was out of step with the equivalent code in htsjdk which packed its results into shorts rather than ints.