Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated MarkDuplicates Scoring and Comparison code to reflect mismatches to picard #5023

Merged
merged 4 commits into from
Aug 15, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ public class MarkDuplicatesSparkUtils {
// This comparator represents the tiebreaking for PairedEnds duplicate marking.
// We compare first on score, followed by unclipped start position (which is reversed here because of the expected ordering)
private static final Comparator<TransientFieldPhysicalLocation> PAIRED_ENDS_SCORE_COMPARATOR = Comparator.comparing(TransientFieldPhysicalLocation::getScore)
.thenComparing(PairedEndsCoordinateComparator.INSTANCE.reversed());
.thenComparing(TransientFieldPhysicalLocationComparator.INSTANCE.reversed());

/**
* Wrapper object used for storing an object and some type of index information.
Expand Down Expand Up @@ -480,32 +480,24 @@ public static void saveMetricsRDD(final MetricsFile<GATKDuplicationMetrics, Doub
}

/**
* Comparator for sorting Reads by coordinate. Note that a header is required in
* order to meaningfully compare contigs.
* Comparator for TransientFieldPhysicalLocation objects by their attributes and strandedness. This comparator is intended to serve as a tiebreaker
* for the score comparator.
*
* Uses the various other fields in a read to break ties for reads that share
* the same location.
* It compares two PhysicalLocation the orientation of the strand, followed by their physical location attributes,
* and finally as a final tiebreaker the readname lexicographical order.
*
* Ordering is not the same as {@link htsjdk.samtools.SAMRecordCoordinateComparator}.
* It compares two pairedEnds objects by their first reads according to first their clipped start positions
* (they were matched together based on UnclippedStartOriginally), then the orientation of the strand, followed by
* the readname lexicographical order.
*
* NOTE: Because the original records were grouped by readname, we know that they must be unique once they hit this
* NOTE: Because the original records were grouped by start position, we know that they must be unique once they hit this
* comparator and thus we don't need to worry about further tiebreaking for this method.
*/
public static final class PairedEndsCoordinateComparator implements Comparator<TransientFieldPhysicalLocation>, Serializable {
public static final class TransientFieldPhysicalLocationComparator implements Comparator<TransientFieldPhysicalLocation>, Serializable {
private static final long serialVersionUID = 1L;

public static final PairedEndsCoordinateComparator INSTANCE = new PairedEndsCoordinateComparator();
private PairedEndsCoordinateComparator() { }
public static final TransientFieldPhysicalLocationComparator INSTANCE = new TransientFieldPhysicalLocationComparator();
private TransientFieldPhysicalLocationComparator() { }

@Override
public int compare( TransientFieldPhysicalLocation first, TransientFieldPhysicalLocation second ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird that this is called PairedEndsCoordiantComparator and it no longer compares PairedEnds or coordinates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for this is woefully out of date now.

int result = Integer.compare(first.getFirstStartPosition(), second.getFirstStartPosition());
if ( result != 0 ) {
return result;
}
int result = 0;

//This is done to mimic SAMRecordCoordinateComparator's behavior
if (first.isRead1ReverseStrand() != second.isRead1ReverseStrand()) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import org.broadinstitute.hellbender.utils.Utils;
import org.broadinstitute.hellbender.utils.read.GATKRead;
import org.broadinstitute.hellbender.utils.read.ReadUtils;

import java.util.function.ToIntFunction;

Expand All @@ -25,10 +26,13 @@ public enum MarkDuplicatesScoringStrategy {

/**
* Given a read, compute the reads's score for mark duplicates.
*
* NOTE: this code is intended to match {@link htsjdk.samtools.DuplicateScoringStrategy.ScoringStrategy} behavior.
*/
public int score(final GATKRead read){
public short score(final GATKRead read){
Utils.nonNull(read);
return scoring.applyAsInt(read);
// We max out at Short.MAX_VALUE / 2 to prevent overflow if we add two very high quality/length reads into pairs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the negative value if it fails vendor quality is strange, maybe mention that in a comment?

return (short) (Math.min(scoring.applyAsInt(read), Short.MAX_VALUE / 2) + (read.failsVendorQualityCheck() ? (short) (Short.MIN_VALUE / 2) : 0));
}

//Bases below this quality will not be included in picking the best read from a set of duplicates.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ public Type getType() {
return Type.EMPTY_FRAGMENT;
}
@Override
public int getScore() {
public short getScore() {
return 0;
}
@Override
Expand All @@ -49,10 +49,6 @@ public ReadsKey key() {
return key;
}
@Override
public int getFirstStartPosition() {
throw new UnsupportedOperationException("Empty fragments do not support requests for positional information");
}
@Override
public boolean isRead1ReverseStrand() {
return R1R;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,13 @@
public class Fragment extends TransientFieldPhysicalLocation {
protected transient ReadsKey key;

private final int firstStartPosition;
private final boolean R1R;

protected final int score;
protected final short score;

public Fragment(final GATKRead first, final SAMFileHeader header, int partitionIndex, MarkDuplicatesScoringStrategy scoringStrategy, Map<String, Byte> headerLibraryMap) {
super(partitionIndex, first.getName());

this.firstStartPosition = first.getAssignedStart();
this.score = scoringStrategy.score(first);
this.R1R = first.isReverseStrand();
this.key = ReadsKey.getKeyForFragment(ReadUtils.getStrandedUnclippedStart(first),
Expand All @@ -48,14 +46,10 @@ public ReadsKey key() {
return key;
}
@Override
public int getScore() {
public short getScore() {
return score;
}
@Override
public int getFirstStartPosition() {
return firstStartPosition;
}
@Override
public boolean isRead1ReverseStrand() {
return R1R;
}
Expand All @@ -65,6 +59,6 @@ public byte getOrientationForPCRDuplicates() {
}
@Override
public String toString() {
return "fragment: " + name + " " + firstStartPosition;
return "fragment: " + name;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,10 @@
public final class Pair extends TransientFieldPhysicalLocation {
protected transient ReadsKey key;

private final int firstStartPosition;
private final boolean isRead1ReverseStrand;

private final boolean isRead2ReverseStrand;
private final int score;
private final short score;
private final boolean wasFlipped;

public Pair(final GATKRead read1, final GATKRead read2, final SAMFileHeader header, int partitionIndex, MarkDuplicatesScoringStrategy scoringStrategy, Map<String, Byte> headerLibraryMap) {
Expand All @@ -39,7 +38,7 @@ public Pair(final GATKRead read1, final GATKRead read2, final SAMFileHeader head
final String name2 = read2.getName();
Utils.validate(name1.equals(name2), () -> "Paired reads have different names\n" + name1 + "\n" + name2);

this.score = scoringStrategy.score(read1) + scoringStrategy.score(read2);
this.score = (short)(scoringStrategy.score(read1) + scoringStrategy.score(read2));

GATKRead first = read1;
GATKRead second;
Expand All @@ -57,7 +56,6 @@ public Pair(final GATKRead read1, final GATKRead read2, final SAMFileHeader head
first = read2;
second = read1;
}
firstStartPosition = first.getAssignedStart();

// if the two read ends are in the same position, pointing in opposite directions,
// the orientation is undefined and the procedure above
Expand Down Expand Up @@ -93,9 +91,8 @@ private Pair(Kryo kryo, Input input){
y = -1;
libraryId = -1;

score = input.readInt();
score = input.readShort();

firstStartPosition = input.readInt();
isRead1ReverseStrand = input.readBoolean();

isRead2ReverseStrand = input.readBoolean();
Expand All @@ -108,9 +105,8 @@ protected void serialize(Kryo kryo, Output output) {
output.writeInt(partitionIndex, true);
output.writeAscii(name);

output.writeInt(score);
output.writeShort(score);

output.writeInt(firstStartPosition);
output.writeBoolean(isRead1ReverseStrand);

output.writeBoolean(isRead2ReverseStrand);
Expand All @@ -129,20 +125,17 @@ public ReadsKey key() {
return key;
}
@Override
public int getScore() {
public short getScore() {
return score;
}
@Override
public int getFirstStartPosition() {
return firstStartPosition;
}

@Override
public boolean isRead1ReverseStrand() {
return isRead1ReverseStrand;
}
@Override
public String toString() {
return name + " " + firstStartPosition + " score:" + score;
return name + " score:" + score;
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,7 @@ public abstract class PairedEnds extends MarkDuplicatesSparkRecord {
super(partitionIndex, name);
}

public abstract int getFirstStartPosition();
public abstract int getScore();
public abstract short getScore();
public abstract boolean isRead1ReverseStrand();

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@
import org.testng.Assert;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;
import picard.sam.util.PhysicalLocationInt;
import picard.sam.util.ReadNameParser;

import java.io.File;
import java.io.FileNotFoundException;
Expand Down Expand Up @@ -245,4 +247,38 @@ public void testNoReadGroupInRead() {
Assert.assertTrue(e.getCause() instanceof UserException.ReadMissingReadGroup);
}
}

@Test(dataProvider = "readNameData")
public void testOpticalDuplicatesTiebrokenByPhysicalLocationNotStartPosition(final String readName1, final String readName2) {
// This tests the readname based tiebreaking code in mark duplicates. Since it's ambiguous which read should be marked
// as duplicate or not if scores match we break ties by evaluating the readname for consistencies sake.

final ReadNameParser parser = new ReadNameParser();

final PhysicalLocationInt position1 = new PhysicalLocationInt();
final PhysicalLocationInt position2 = new PhysicalLocationInt();

parser.addLocationInformation(readName1, position1);
parser.addLocationInformation(readName2, position2);

final AbstractMarkDuplicatesTester tester = getTester();
tester.getSamRecordSetBuilder().setReadLength(101);
tester.setExpectedOpticalDuplicate(0);

int compare = position1.tile - position2.tile;
if (compare == 0) {
compare = position1.x - position2.x;
}

if (compare == 0) {
compare = position1.y - position2.y;
}

final boolean isDuplicate = compare < 0;

// NOTE these reads are offset slightly but should have the same unclipped start postitions
tester.addMatePair(readName1, 1,2, 46, false, false, !isDuplicate, !isDuplicate, "6S42M28S", "3S68M", true, false, false, false, false, DEFAULT_BASE_QUALITY);
tester.addMatePair(readName2, 1,2, 51, false, false, isDuplicate, isDuplicate, "6S42M28S", "8S68M", true, false, false, false, false, DEFAULT_BASE_QUALITY);
tester.runTest();
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@
import org.testng.annotations.BeforeClass;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;
import picard.sam.util.PhysicalLocationInt;
import picard.sam.util.ReadNameParser;

import java.io.File;
import java.util.*;
Expand Down