Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to Mutect2's Permutect training data mode #8663

Merged
merged 4 commits into from
Jan 26, 2024

Conversation

davidbenjamin
Copy link
Contributor

@davidbenjamin davidbenjamin commented Jan 22, 2024

@LeeTL1220 here are the changes I was talking about. The first commit contains a small change I have been using for a year, the second is the recent multiallelic indel representation bug fix.

Copy link
Contributor

@LeeTL1220 LeeTL1220 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor comments.

public Mutect3DatasetMode mutect3DatasetMode = Mutect3DatasetMode.ILLUMINA;

public enum Mutect3DatasetMode {
ILLUMINA(11),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a little nervous when you hardcode the sequencing technology, especially when given the same value. Why not just make this a default? What is the justification for this? (We do not need to block this PR over this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Ultima uses unpaired reads it can't be encoded in the same way as Illumina data. It's only a coincidence that the dimension of read tensors is 11 in both cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And note that this is an Advanced argument)

final Allele longestRef = allRefAlleles.stream().sorted(Comparator.comparingInt(Allele::length).reversed()).findFirst().get();

final List<Allele> remappedAltAlleles = ReferenceConfidenceVariantContextMerger.remapAlleles(vc, longestRef).stream()
.skip(1).toList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Is skip(1) just because we need to skip the reference allele? If so, just add a comment that you are making this assumption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@davidbenjamin
Copy link
Contributor Author

@LeeTL1220 Back to you. Note one extra commit setting the default ratio of non-artifact training examples to artifact examples to 1 instead of 20. I checked and this has no effect on the quality of the trained model while vastly decreasing training time.

@davidbenjamin davidbenjamin merged commit 2d50cf8 into master Jan 26, 2024
20 checks passed
@davidbenjamin davidbenjamin deleted the db_permutect_training_data branch January 26, 2024 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants