Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[translate] misleading key names; differing behaviour between JSON & VCF inputs #1361

Closed
jameshadfield opened this issue Dec 18, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@jameshadfield
Copy link
Member

jameshadfield commented Dec 18, 2023

Current Behaviour

The JSON produced by augur translate contains the following data (which are AA translations):

  1. <JSON>.reference.<gene>
  2. <JSON>.node.<rootNodeName>.aa_sequences.<gene> (not currently used for VCF inputs, but I have a WIP commit which adds this)

When using a JSON input to --ancestral-sequences (1) is simply the AA sequence from the root node - i.e. (1) == (2). It is not the "reference" AA sequence, because the command has no knowledge of what the reference sequence is¹.

When using VCF inputs, a corresponding nucleotide FASTA reference input is required, and (1) is a translation of the gene's region from this sequence.

There are two salient points:

  • The "reference" key is a misnomer when JSON inputs are used. (For VCF inputs this is fine.) If there are mutations in ~all nodes relative to the reference then (1) will be different depending on the choice of JSON vs VCF inputs.
  • Mutations are not inferred on the root node when JSON inputs are used, because there's nothing to compare the root node to². (For VCF files mutations are inferred against the translated reference.)

Expected behavior

VCF input or FASTA/JSON input to ancestral/translate should not result in changes in inference or mutation annotations on the tree. Ideally the outputs would also be the same, but for file size reasons this is not always possible.

How to reproduce

TODO: create a test to demonstrate this. Unfortunately the "simple-genome" tests I've recently added in PRs don't have a AA mutation shared across all sequences, which is what's needed here.

Possible solution

  • Allow JSON inputs to have a corresponding (nuc) reference sequence, and use this for the "reference" translations. (This is what VCF inputs do.) In this case we can also infer mutations on the root node. We could add an extra argument (mirroring VCF input) or use the reference.nuc key in the input JSON¹.
  • Remove the "reference" key when using JSON inputs without a provided reference sequence. This will be problematic for augur export v2 as it uses this to export root-sequences. (The names here get very confusing very fast.)

Your environment: if running Nextstrain locally

augur 23.1.1

Footnotes

¹ Ok this isn't quite true. The JSON (produced by augur ancestral) will have a json.refererence.nuc sequence, but augur translate never reads it. Depending on how augur ancestral was run, this may be a reference sequence or the inferred sequence at the tree root.

² I think certain invocations of augur ancestral will produce nuc mutations on the root node. I don't know what augur translate will do in this case.

@jameshadfield
Copy link
Member Author

jameshadfield commented Jan 25, 2024

Closed by b6d537e / PR #1355 / Augur release 24.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant