Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated MarkDuplicates Scoring and Comparison code to reflect mismatches to picard #5023

Merged
merged 4 commits into from
Aug 15, 2018

Conversation

jamesemery
Copy link
Collaborator

After delving into picard and MarkDuplicates I have found two more sources of inconsistencies, first we were preserving the original mapped start position which we used for tie-breaking. This was unnecessary and caused problems at sites with very high depth. Additionally, our scoring code was out of step with the equivalent code in htsjdk which packed its results into shorts rather than ints.

@jamesemery jamesemery requested a review from lbergelson July 17, 2018 18:32
@codecov-io
Copy link

codecov-io commented Jul 17, 2018

Codecov Report

Merging #5023 into master will decrease coverage by 0.002%.
The diff coverage is 96.429%.

@@              Coverage Diff               @@
##             master     #5023       +/-   ##
==============================================
- Coverage     86.49%   86.488%   -0.002%     
- Complexity    29203     29270       +67     
==============================================
  Files          1814      1814               
  Lines        135364    135570      +206     
  Branches      15042     15069       +27     
==============================================
+ Hits         117077    117252      +175     
- Misses        12826     12848       +22     
- Partials       5461      5470        +9
Impacted Files Coverage Δ Complexity Δ
...ead/markduplicates/sparkrecords/EmptyFragment.java 84.615% <ø> (+6.044%) 5 <0> (ø) ⬇️
...s/read/markduplicates/sparkrecords/PairedEnds.java 100% <ø> (ø) 1 <0> (ø) ⬇️
...kduplicates/MarkDuplicatesGATKIntegrationTest.java 97.826% <ø> (ø) 22 <0> (ø) ⬇️
.../markduplicates/MarkDuplicatesScoringStrategy.java 76.471% <0%> (-5.882%) 7 <0> (ø)
...r/utils/read/markduplicates/sparkrecords/Pair.java 100% <100%> (ø) 26 <1> (-1) ⬇️
...forms/markduplicates/MarkDuplicatesSparkUtils.java 91.284% <100%> (-0.079%) 65 <0> (ø)
.../pipelines/MarkDuplicatesSparkIntegrationTest.java 89.933% <100%> (+1.383%) 27 <6> (+6) ⬆️
...ils/read/markduplicates/sparkrecords/Fragment.java 92.857% <100%> (-0.893%) 6 <1> (-1)
...ats/collections/SimpleCountCollectionUnitTest.java 83.784% <0%> (-5.105%) 5% <0%> (+2%)
...ber/formats/collections/SimpleCountCollection.java 100% <0%> (ø) 12% <0%> (+1%) ⬆️
... and 4 more

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments about comments and names. Should there be a test for the changes in this branch? It looks like the test is dealing with the changes from the previous branch.

@@ -502,10 +502,7 @@ private PairedEndsCoordinateComparator() { }

@Override
public int compare( TransientFieldPhysicalLocation first, TransientFieldPhysicalLocation second ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird that this is called PairedEndsCoordiantComparator and it no longer compares PairedEnds or coordinates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for this is woefully out of date now.

Utils.nonNull(read);
return scoring.applyAsInt(read);
// We max out at Short.MAX_VALUE / 2 to prevent overflow if we add two very high quality/length reads into pairs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the negative value if it fails vendor quality is strange, maybe mention that in a comment?

@@ -474,38 +480,43 @@ public static void saveMetricsRDD(final MetricsFile<GATKDuplicationMetrics, Doub
}

/**
* Comparator for sorting Reads by coordinate. Note that a header is required in
* order to meaningfully compare contigs.
* Comparator for by their PhysicalLoccation attributes and strandedness. This comparator is intended to serve as a tiebreaker
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, Loccation

* It compares two pairedEnds objects by their first reads according to first their clipped start positions
* (they were matched together based on UnclippedStartOriginally), then the orientation of the strand, followed by
* the readname lexicographical order.
* Ordering is not the same as {@link htsjdk.samtools.SAMRecordCoordinateComparator}, except for the alignment position.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is still out of date, it doesn't look at the alignment position anymore

@jamesemery jamesemery force-pushed the je_updateDefaultScoringMechanism branch from 12f4487 to 8c978ad Compare August 13, 2018 21:23
@@ -480,23 +480,20 @@ public static void saveMetricsRDD(final MetricsFile<GATKDuplicationMetrics, Doub
}

/**
* Comparator for by their PhysicalLoccation attributes and strandedness. This comparator is intended to serve as a tiebreaker
* Comparator for by their PhysicalLocation attributes and strandedness. This comparator is intended to serve as a tiebreaker
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for by their

@jamesemery jamesemery force-pushed the je_updateDefaultScoringMechanism branch from f258774 to 81bcdeb Compare August 13, 2018 21:31
… start position of mismatching reads. Additionally changed the score to be a double packed short value in order to better reflect picard scoring code.
@jamesemery jamesemery force-pushed the je_updateDefaultScoringMechanism branch from 81bcdeb to 5db98c8 Compare August 15, 2018 14:28
@jamesemery
Copy link
Collaborator Author

@lbergelson Removed the broken tests. The problem stemmed from markDuplicatesGATK being out of date. Since we are removing to remove it soon anyway i figured we would have the test only apply for markDuplicatesSpark

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery Sounds good 👍

@jamesemery jamesemery merged commit d5a500a into master Aug 15, 2018
@jamesemery jamesemery deleted the je_updateDefaultScoringMechanism branch August 15, 2018 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants