Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

augur tree modifying strain names without warning #1084

Closed
jameshadfield opened this issue Nov 1, 2022 · 9 comments · Fixed by #1750
Closed

augur tree modifying strain names without warning #1084

jameshadfield opened this issue Nov 1, 2022 · 9 comments · Fixed by #1750
Assignees
Labels
bug Something isn't working

Comments

@jameshadfield
Copy link
Member

Current Behavior

augur tree modifies strain names in the output tree and does not notify the user

Expected behavior

The strain names should be the same. If this is not possible (due to the newick schema) then we should throw an error (as downstream commands are inevitably going to fail).

How to reproduce

Steps to reproduce the current behavior:

  1. Alignment & metadata with a strain name such as Coted'Ivoire/IPCI-DVE-GR1901/2022 (which, BTW, comes from our nCoV provisioned files)
  2. augur tree --alignment aligned.fasta --tree-builder-args '-ninit 10 -n 4' --output tree_raw.nwk --nthreads 1
  3. No error / warning from augur tree
  4. Strain name in output newick tree is now Coted_Ivoire/IPCI-DVE-GR1901/2022

Your environment

augur 17.1.0

@jameshadfield jameshadfield added the bug Something isn't working label Nov 1, 2022
@anna-parker
Copy link
Collaborator

From looking into augur the issue is that iqtree changes certain characters in the names of input sequences - we have taken care of this for some characters but not the ' character. These changes are written to the .log file. I was trying to find a definitive list of characters that iqtree will change and I believe it would be better to apply the same check that iqtree does - i.e. if a character is not alphanumeric or is equal to '_', '-' or '.' it will be changed: https://github.com/Cibiv/IQ-TREE/blob/6776a95f15a2eccda2aa330497291dc246575995/utils/tools.cpp#L503. Alternatively we could read the changed node names from the iqtree log and print these as a warning. I will make a PR for the first approach.

@corneliusroemer
Copy link
Member

Thanks @anna-parker for tackling this and using the IQ-tree source code to make sure we're totally aligned!

You actually linked to the old repo that's unfortunately not yet been archived. This is the currently maintained IQtree repo: https://github.com/iqtree/iqtree2/blob/3bbc304263cb2f85574a9163e8f2e5c5b597a147/utils/tools.cpp#L585

@jameshadfield why are you still on Augur 17.1.0 :)

@genehack
Copy link
Contributor

(relocating discussion from #769…)

I'm not sure I understand what the antecedent of "this" in "prefer we don't do this" is

This: the -delim.* file(s). The reason we create them is to munge strain names. (Right? Is there something else they're for?) #1084 argues that this should happen upstream of augur tree and that augur tree should not change strain names, and error if it detects strain names which it believes IQ-TREE will change.

Sorry if I'm being obtuse — I don't see the "upstream" part anywhere in this issue; maybe it's implicit that if augur tree isn't doing it, something run before augur tree will need to?

It also looks like — unless I'm misunderstanding something, which is very possible — the name-munging is reversed and the resulting Bio.Phylo object (T in the above code block) gets returned and then written out?

I suspect maybe a trivial fix here is expanding the set of characters in escape_dict — unless there are some characters that are just not allowed in Newick trees? A quick experiment with adding ' to escape_dict and then building a tree from a align.mfa file with a ' in one of the deflines shows the identifier, with the (escaped) single quote present in the output tree file, so at least the rest of the toolchain doesn't mind the single quote…

@joverlee521
Copy link
Contributor

maybe it's implicit that if augur tree isn't doing it, something run before augur tree will need to?

I've run into error twice in the avian-flu data and ended up fixing the strain names in ingest. This is fine for our automated workflows, but can be an annoying gotcha for external users just doing ad-hoc analysis.

the name-munging is reversed and the resulting Bio.Phylo object (T in the above code block) gets returned and then written out?

I suspect maybe a trivial fix here is expanding the set of characters in escape_dict — unless there are some characters that are just not allowed in Newick trees?

+1 for expanding the set of characters in the escape_dict.

@jameshadfield jameshadfield self-assigned this Feb 10, 2025
@jameshadfield
Copy link
Member Author

jameshadfield commented Feb 10, 2025

+1 for expanding the set of characters in the escape_dict.

I'm about to submit a PR doing just this, however this approach won't be enough to close this issue. A ' (single quote) character, as in the original example in this issue, when written via

tree_success = Phylo.write(T, tree_fname, 'newick', format_branch_length='%1.8f')

will be encoded as \\' \' in the newick. We need to fix these issues upstream.

@genehack
Copy link
Contributor

+1 for expanding the set of characters in the escape_dict.
[snip]
will be encoded as \\' in the newick. We need to fix these issues upstream.

What's "upstream" in this context? Upstream in the ingest workflow or upstream in some non-Augur place?

@jameshadfield
Copy link
Member Author

What's "upstream" in this context? Upstream in the ingest workflow or upstream in some non-Augur place?

In this context I'm just saying that we should have alerted the user (or fixed the problem) before we get to augur tree. I don't know the best place - perhaps @joverlee521 has ideas?

@joverlee521
Copy link
Contributor

What's "upstream" in this context? Upstream in the ingest workflow or upstream in some non-Augur place?

In this context I'm just saying that we should have alerted the user (or fixed the problem) before we get to augur tree. I don't know the best place - perhaps @joverlee521 has ideas?

Hmm, maybe we could add a warning for invalid characters in Augur's centralized I/O functions (e.g. read_metadata and read_sequences) so that users get alerted early on in the workflow.

@genehack
Copy link
Contributor

What's "upstream" in this context? Upstream in the ingest workflow or upstream in some non-Augur place?

In this context I'm just saying that we should have alerted the user (or fixed the problem) before we get to augur tree. I don't know the best place - perhaps @joverlee521 has ideas?

Hmm, maybe we could add a warning for invalid characters in Augur's centralized I/O functions (e.g. read_metadata and read_sequences) so that users get alerted early on in the workflow.

+1 to the idea of centralizing the warning in the IO layer — would probably be nice to add something like augur curate fix_metadata (or maybe augur curate standardize_metadata_labels or something like that)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Status: In Review
5 participants