Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed an issue of data that contain tp tag sometimes recognized as flow #8337

Conversation

ilyasoifer
Copy link
Collaborator

Fixes #8335

@droazen
Copy link
Contributor

droazen commented May 22, 2023

Thanks @ilyasoifer ! Was the test bam you added aligned with minimap2? If not, we should make sure to add at least one simple HaplotypeCaller regression test that takes an actual minimap2-aligned input.

@ilyasoifer
Copy link
Collaborator Author

@droazen the bam that I added was aligned with minimap2. However, it is a really small BAM, so I could not run it through the integration tests. It sounds useful to add minimap2-aligned bam to HaplotypeCallerIntegrationTest. Can you or @jamesemery point me to a suitable BAM for that?

@jamesemery
Copy link
Collaborator

@ilyasoifer do you have the means to quickly run minimap2? I would recommend simply realigning src/test/resources/large/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam with minimap2 and making a quick "are tests consistent with pervious versions" test with a checked in vcf output. I don't know how to wrangle minimap2 to handle mates correctly however so i don't know if this is easy

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilyasoifer Thank you for doing this. A few more defensive checks and future-proofing to hopefully save us from having this headache again in the future. Could you realign our HaplotypeCaller input bam snippets with minimap 2 and make a quick integration test asserting that they still funciton?

if (readGroup.getPlatform()==null){
isFlowPlatform = false;
} else {
isFlowPlatform = NGSPlatform.valueOf(readGroup.getPlatform()).getSequencerType() == SequencerFlowClass.FLOW;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know these SequencerFlowClass objects exist and in theory should handle other techonologies, but the codepaths here really only support Ultima. I would change this to check for the Ultima readgroup (so treat potential Iontorrent or 454 data which is also flow) seperately for now so long as they aren't tested. This seems like a recipe for those technologies falling over.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old Ultima data used PL:LS454 because ULTIMA was not a legal platform. I think I prefer allowing LS454 and ULTIMA to work this way given low chance anyone is now running LS454 data

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you special case the PL:LS454 platform and give an explanation about it here? Again, I want to be as specific as possible about the inputs here since this is a greedy non opt-in behavior we should be careful we don't start clobbering somebodies inputs in all cases. If it IS labeled as 454 data can you check for the other fields that should make it readable as ultima data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests

isFlowPlatform = NGSPlatform.valueOf(readGroup.getPlatform()).getSequencerType() == SequencerFlowClass.FLOW;
}
if (isFlowPlatform) {
Utils.nonNull(readGroup);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is probably the first line of defense for users submitting mis-formatted flow based data to the HaplotypeCaller, can you upgrade these non-null checks to be a little more stringent about expected flow-based header lines and throw exceptions with UserException enforcing the presence of those fields and printing out the readgroup header line in the error message? (and add a quick unit test for the flow platform construction)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I fixed this, can you check?

SAMReadGroupRecord rg1 = header.getReadGroup("test");
SAMRecord read = reader.iterator().next();
GATKRead gread = new SAMRecordToGATKReadAdapter(read);
assert(FlowBasedReadUtils.hasFlowTags(gread));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you record the actual readgroups from that file in a comment here and add a bit of explanation to this test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

final byte minTailQualityToUse = errorCorrectReads ? HaplotypeCallerEngine.MIN_TAIL_QUALITY_WITH_ERROR_CORRECTION : minTailQuality;

final List<GATKRead> readsToUse = new ArrayList<>();
final List<GATKRead> hardClippedReadsToUse = new ArrayList<>();

for (final GATKRead originalRead : region.getReads()) {

boolean isFlowBasedData = false;
if (FlowBasedReadUtils.hasFlowTags(originalRead)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be condensed into a single method so external tools don't have to know about the readgroup info caching? Can you make a method FlowBasedReadUtils.isThisFlowBased() that extracts this entire operation away and handles your readgroup caching internally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@ilyasoifer ilyasoifer requested review from jamesemery and droazen May 24, 2023 06:16
@ilyasoifer
Copy link
Collaborator Author

@droazen , @jamesemery - I think I addressed your comments, back to you now.
Looking at this deeper, I think we will anyway remove the "removeUncertainFlows" options that triggered all this
It resulted in a need to fix a bug we had in a basecalling, but it had been long resolved.

In any case, I think that the check is more robust now, and it was a good fix.

@ilyasoifer
Copy link
Collaborator Author

@ilyasoifer do you have the means to quickly run minimap2? I would recommend simply realigning src/test/resources/large/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam with minimap2 and making a quick "are tests consistent with pervious versions" test with a checked in vcf output. I don't know how to wrangle minimap2 to handle mates correctly however so i don't know if this is easy

Added test like this

// we require also FO field to consider ReadGroup to come from the flow
boolean tmp = (NGSPlatform.fromReadGroupPL(readGroup.getPlatform()) == NGSPlatform.LS454) ||
(NGSPlatform.fromReadGroupPL(readGroup.getPlatform()) == NGSPlatform.ULTIMA);
tmp = tmp & (readGroup!=null);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readGroup!=null here after referencing it several times above?

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments and one test update and then i think it looks good. Thank you for the quick turnaround on fixing this!

return rec.hasAttribute(FlowBasedRead.FLOW_MATRIX_TAG_NAME)
|| rec.hasAttribute(FlowBasedRead.FLOW_MATRiX_OLD_TAG_KR)
|| rec.hasAttribute(FlowBasedRead.FLOW_MATRiX_OLD_TAG_TI);

}

public static boolean isFlowPlatform(final SAMFileHeader hdr, final GATKRead read) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just add a javadoc to this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also possibly update the class javadoc to point people to this method as the definitive "is this flow based and should i trigger that code" method

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

/*
* Test that minimap2 data are supported and consistent with past results
*/
public void testVCFModeMinimap2IsConsistentWithPastResults() throws Exception {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't get run without adding the @test tag

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

tests.add(new Object[]{hdr, read3, false});

GATKRead read4 = makeRead(new byte[]{'T','A','G','C','G','A'}, false);
read4.setAttribute("tp",new byte[6]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test looks good, can you add in a check for the special cases where Platform=454 +FO tag present and Platform=454 + no FO tag present?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

isFlowPlatform = tmp;

//in addition validate that the ultima data has FO tag, otherwise break
if ((NGSPlatform.fromReadGroupPL(readGroup.getPlatform()) == NGSPlatform.ULTIMA) & (!isFlowPlatform)){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic here is a little confusing though I believe it accomplishes what you want. You might consider refactoring this a little so its cleaner but its not critical.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments and one test update and then i think it looks good. Thank you for the quick turnaround on fixing this!

@ilyasoifer
Copy link
Collaborator Author

@jamesemery I believe I addressed your comments, PTAL

@ilyasoifer ilyasoifer requested a review from jamesemery May 24, 2023 19:15
Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One FINAL gripe and then its good to go. Thank you for going back and forth on this.

isFlowPlatform = false;
} else if (NGSPlatform.fromReadGroupPL(readGroup.getPlatform())==NGSPlatform.UNKNOWN){
isFlowPlatform = false;
} else if (NGSPlatform.fromReadGroupPL(readGroup.getPlatform()) == NGSPlatform.LS454) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know 454 data is ancient at this point but i would still prefer this LS454 case to explicitly check for

if (readGroup.getFlowOrder()!=null) {
                    isFlowPlatform = true; else false;

I'm not sure i trust the Ultima flow ordering code is going to handle 454 data well without testing it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not this check is supposed to do. Old Ultima data were labeled LS454 and initially did not have the FO tag. Afterwards the tag was added .
The full check is done by isFlowPlatform function that also verifies that the reads have tp tag.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh i see... horrifying... well i'm alright to let this through in this case then

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One FINAL gripe and then its good to go. Thank you for going back and forth on this.

@ilyasoifer ilyasoifer requested a review from jamesemery May 25, 2023 19:26
@ilyasoifer ilyasoifer force-pushed the ilyasoifer/fix.tp.requirement.issue-8335 branch from 5395d75 to 5ce230a Compare May 25, 2023 19:35
Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright this should be good 👍

@ilyasoifer
Copy link
Collaborator Author

Thanks for the help!

@ilyasoifer ilyasoifer merged commit 4456792 into broadinstitute:master May 26, 2023
@ilyasoifer ilyasoifer deleted the ilyasoifer/fix.tp.requirement.issue-8335 branch May 26, 2023 04:34
@AishaShah
Copy link

Is this fixed in version 4.4.0.0 of gatk?

@droazen
Copy link
Contributor

droazen commented Jul 13, 2023

@AishaShah No, this fix will be part of the next GATK release, which we expect to go out shortly (within the next several weeks).

@rmormando
Copy link

Hi! I know this issue is closed and a solution has been found but I am facing this same issue right now and wanted to follow up and see if it's been patched in the current 4.4.0.0 version of gatk? Or is the new release expected to be published soon?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants