Improvements to Mutect2's Permutect training data mode #8663

davidbenjamin · 2024-01-22T19:17:15Z

@LeeTL1220 here are the changes I was talking about. The first commit contains a small change I have been using for a year, the second is the recent multiallelic indel representation bug fix.

…nt M3 dataset modes -- Illumina and Ultima

LeeTL1220

Very minor comments.

LeeTL1220 · 2024-01-23T20:22:16Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/mutect/M2ArgumentCollection.java

+    public Mutect3DatasetMode mutect3DatasetMode = Mutect3DatasetMode.ILLUMINA;
+
+    public enum Mutect3DatasetMode {
+        ILLUMINA(11),


I get a little nervous when you hardcode the sequencing technology, especially when given the same value. Why not just make this a default? What is the justification for this? (We do not need to block this PR over this)

Since Ultima uses unpaired reads it can't be encoded in the same way as Illumina data. It's only a coincidence that the dimension of read tensors is 11 in both cases.

(And note that this is an Advanced argument)

LeeTL1220 · 2024-01-23T20:23:14Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/mutect/Mutect3DatasetEngine.java

+        final Allele longestRef = allRefAlleles.stream().sorted(Comparator.comparingInt(Allele::length).reversed()).findFirst().get();
+
+        final List<Allele> remappedAltAlleles = ReferenceConfidenceVariantContextMerger.remapAlleles(vc, longestRef).stream()
+                .skip(1).toList();


Nit: Is skip(1) just because we need to skip the reference allele? If so, just add a comment that you are making this assumption.

davidbenjamin · 2024-01-26T17:33:40Z

@LeeTL1220 Back to you. Note one extra commit setting the default ratio of non-artifact training examples to artifact examples to 1 instead of 20. I checked and this has no effect on the quality of the trained model while vastly decreasing training time.

davidbenjamin added 2 commits January 2, 2024 17:14

include normal seq error log likelihood in Permutect dataset, differe…

ac97712

…nt M3 dataset modes -- Illumina and Ultima

handle different alelle representations in multiallelic / indel variants

4fe76ce

davidbenjamin assigned LeeTL1220 Jan 22, 2024

davidbenjamin requested a review from LeeTL1220 January 22, 2024 21:16

LeeTL1220 reviewed Jan 23, 2024

View reviewed changes

davidbenjamin added 2 commits January 26, 2024 12:30

comment

306ccdf

set the default artifact to non-artifact ratio to 1

fb2eedd

LeeTL1220 approved these changes Jan 26, 2024

View reviewed changes

davidbenjamin merged commit 2d50cf8 into master Jan 26, 2024
20 checks passed

davidbenjamin deleted the db_permutect_training_data branch January 26, 2024 18:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to Mutect2's Permutect training data mode #8663

Improvements to Mutect2's Permutect training data mode #8663

davidbenjamin commented Jan 22, 2024 •

edited

Loading

LeeTL1220 left a comment

LeeTL1220 Jan 23, 2024

davidbenjamin Jan 26, 2024

davidbenjamin Jan 26, 2024

LeeTL1220 Jan 23, 2024

davidbenjamin Jan 26, 2024

davidbenjamin commented Jan 26, 2024

Improvements to Mutect2's Permutect training data mode #8663

Improvements to Mutect2's Permutect training data mode #8663

Conversation

davidbenjamin commented Jan 22, 2024 • edited Loading

LeeTL1220 left a comment

Choose a reason for hiding this comment

LeeTL1220 Jan 23, 2024

Choose a reason for hiding this comment

davidbenjamin Jan 26, 2024

Choose a reason for hiding this comment

davidbenjamin Jan 26, 2024

Choose a reason for hiding this comment

LeeTL1220 Jan 23, 2024

Choose a reason for hiding this comment

davidbenjamin Jan 26, 2024

Choose a reason for hiding this comment

davidbenjamin commented Jan 26, 2024

davidbenjamin commented Jan 22, 2024 •

edited

Loading