[translate] misleading key names; differing behaviour between JSON & VCF inputs #1361

jameshadfield · 2023-12-18T02:01:37Z

Current Behaviour

The JSON produced by augur translate contains the following data (which are AA translations):

<JSON>.reference.<gene>
<JSON>.node.<rootNodeName>.aa_sequences.<gene> (not currently used for VCF inputs, but I have a WIP commit which adds this)

When using a JSON input to --ancestral-sequences (1) is simply the AA sequence from the root node - i.e. (1) == (2). It is not the "reference" AA sequence, because the command has no knowledge of what the reference sequence is¹.

When using VCF inputs, a corresponding nucleotide FASTA reference input is required, and (1) is a translation of the gene's region from this sequence.

There are two salient points:

The "reference" key is a misnomer when JSON inputs are used. (For VCF inputs this is fine.) If there are mutations in ~all nodes relative to the reference then (1) will be different depending on the choice of JSON vs VCF inputs.
Mutations are not inferred on the root node when JSON inputs are used, because there's nothing to compare the root node to². (For VCF files mutations are inferred against the translated reference.)

Expected behavior

VCF input or FASTA/JSON input to ancestral/translate should not result in changes in inference or mutation annotations on the tree. Ideally the outputs would also be the same, but for file size reasons this is not always possible.

How to reproduce

TODO: create a test to demonstrate this. Unfortunately the "simple-genome" tests I've recently added in PRs don't have a AA mutation shared across all sequences, which is what's needed here.

Possible solution

Allow JSON inputs to have a corresponding (nuc) reference sequence, and use this for the "reference" translations. (This is what VCF inputs do.) In this case we can also infer mutations on the root node. We could add an extra argument (mirroring VCF input) or use the reference.nuc key in the input JSON¹.
Remove the "reference" key when using JSON inputs without a provided reference sequence. This will be problematic for augur export v2 as it uses this to export root-sequences. (The names here get very confusing very fast.)

Your environment: if running Nextstrain locally

augur 23.1.1

Footnotes

¹ Ok this isn't quite true. The JSON (produced by augur ancestral) will have a json.refererence.nuc sequence, but augur translate never reads it. Depending on how augur ancestral was run, this may be a reference sequence or the inferred sequence at the tree root.

² I think certain invocations of augur ancestral will produce nuc mutations on the root node. I don't know what augur translate will do in this case.

The text was updated successfully, but these errors were encountered:

jameshadfield · 2024-01-25T01:34:22Z

Closed by b6d537e / PR #1355 / Augur release 24.0.0

jameshadfield added the bug Something isn't working label Dec 18, 2023

jameshadfield mentioned this issue Dec 18, 2023

[ancestral] reference seq may be reference or inferred tree root #1362

Open

jameshadfield closed this as completed Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[translate] misleading key names; differing behaviour between JSON & VCF inputs #1361

[translate] misleading key names; differing behaviour between JSON & VCF inputs #1361

jameshadfield commented Dec 18, 2023 •

edited

Loading

jameshadfield commented Jan 25, 2024 •

edited

Loading

[translate] misleading key names; differing behaviour between JSON & VCF inputs #1361

[translate] misleading key names; differing behaviour between JSON & VCF inputs #1361

Comments

jameshadfield commented Dec 18, 2023 • edited Loading

Current Behaviour

Expected behavior

How to reproduce

Possible solution

Your environment: if running Nextstrain locally

Footnotes

jameshadfield commented Jan 25, 2024 • edited Loading

jameshadfield commented Dec 18, 2023 •

edited

Loading

jameshadfield commented Jan 25, 2024 •

edited

Loading