Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project phylogeny up tree if provided #471

Open
kwcantrell opened this issue Dec 19, 2020 · 20 comments
Open

Project phylogeny up tree if provided #471

kwcantrell opened this issue Dec 19, 2020 · 20 comments

Comments

@kwcantrell
Copy link
Collaborator

It would be nice to have empress automatically label internal with a taxonomy if a user provided a taxonomy .qza file. I thought we had this feature already implemented but after a discussion with Imran, I realized this is not currently implemented.

@gibsramen
Copy link
Collaborator

I'm not sure I understand how this would work as taxonomy != phylogeny.

@kwcantrell kwcantrell changed the title Project taxonomy up tree if provided Project phylogeny up tree if provided Dec 19, 2020
@kwcantrell
Copy link
Collaborator Author

Thanks @gibsramen. I ment phylogeny not taxonomy

@tanaes
Copy link
Collaborator

tanaes commented Dec 19, 2020

In this context, I think taxonomy is actually what you want -- what does it mean to project phylogeny up a phylogeny?

Taxonomic labels may or may not correspond to the estimated phylogenetic relationships, but in the case where there's no discordance (or in the case where discordance < some threshold), it is often nice to be able to inherit some external set of taxonomic labels using a phylogeny. Is that what you're meaning here?

@gwarmstrong
Copy link
Member

Agree with @tanaes

For a concrete example, here is empress displaying taxonomy labels for a tip:
Screen Shot 2020-12-19 at 10 22 46 AM

And this is what it shows when a non-tip node is selected:
Screen Shot 2020-12-19 at 10 23 01 AM

Is the idea here that the internal nodes in the phylogeny could have some notion of the taxa that descend from it within the phylogeny?

If so, it could make sense to label internal nodes with the lowest common taxonomy ancestor of the nodes that descend from it in the phylogeny. This should be relatively easy to compute if the fields that represent levels of a taxa are relatively standardized within feature metadata. (forgive me, I am not super familiar with these q2-types at the moment).

But it could look something like this (note: this is pseudo-code):

for node in postOrderTraveral(tree):
    if not isLeaf(node): 
        for level in node.taxonomy:
            allSame, value = allChildrenHaveSameValue(node, level)
            if allSame:
                node.taxonomy[level] = value

I would imagine this is similar to what is currently being done for collapsing clades.

This could also extend to projecting"other metadata fields. In general, we would need to be careful of places where it would not make sense to project the field up the tree (like confidence scores).

@kwcantrell
Copy link
Collaborator Author

Is the idea here that the internal nodes in the phylogeny could have some notion of the taxa that descend from it within the phylogeny?

If so, it could make sense to label internal nodes with the lowest common taxonomy ancestor of the nodes that descend from it in the phylogeny. This should be relatively easy to compute if the fields that represent levels of a taxa are relatively standardized within feature metadata. (forgive me, I am not super familiar with these q2-types at the moment).

@gwarmstrong that is the idea. Basically, label internal nodes with the lowest common taxonomy of its tips.

@gibsramen
Copy link
Collaborator

Related to what @tanaes mentioned is there literature on what a good threshold would be in this case? We could maybe add an input users could specify but dunno what a good default would be.

@kwcantrell
Copy link
Collaborator Author

I guess I am not familiar enough with how the taxonomy is calculated to properly comment on this. But I would assume that the taxonomic level of internal nodes would match the lowest shared taxonomic level of its tips.

@tanaes
Copy link
Collaborator

tanaes commented Dec 19, 2020

Worth taking a look at Tax2Tree from our very own @wasade!

@gwarmstrong
Copy link
Member

For 16S taxonomy classification, taking a peek at at Tax2Tree as @tanaes metnioned, as well as https://github.com/qiime2/q2-feature-classifier should yield some answers. IIRC aligning the ASV/OTU/etc sequence against a reference, or using some other method, such as Naive Bayes to estimate the probability that a sequence is from a specific taxa are different ways one can classify taxonomy.

For metagenomics, you could take a look at woltka, kraken2, metaphlan2 just for starters on the myriad ways that taxonomy is calculated, all with their own metrics on what constitutes a "good" hit for taxonomy.

However, Empress is sequence/technology agnostic. So anything that estimates the taxonomy of some internal node using the sequence features is probably off the table (and should be, because this makes generalizing across 16S and metagenomics more difficult, or even across methods within a given sequencing techonology).

I think the most general thing we could do here is expose the same feature projections used for feature metadata clade collapsing.

@wasade
Copy link
Member

wasade commented Dec 21, 2020

q2-feature-classifier won't place internal node labels. tax2tree will place labels on internal nodes and contention in placements. It's inputs are a phylogeny and a file containing tip -> lineage strings, and is agnostic to 16S/WGS. A visual example of the algorithm can be seen here

@wasade
Copy link
Member

wasade commented Dec 21, 2020

...it will be more robust than the feature metadata clade collapsing. LCA does not work well for this, and getting the nesting of taxonomy ranks correct on placement can get tedious

@gwarmstrong
Copy link
Member

So it seems like tax2tree differs from LCA in that tax2tree (feel free to confirm/deny):

  1. Does not require all descendants of an internal node to share the same label at a given level. (e.g., an internal node could be assigned to g__Clostridium, even if it has some small proportion of descendants from g__Dorea).
  2. Newly labeled internal nodes can be used to label unlabeled descendants (known as back-filling).

I think this raises some important points:

  • Does empress already allow for internal node labels to be provided by feature metadata? e.g., if a user has a method for labeling internal nodes, can they supply these labels?
    • If so, what happens when their feature metadata disagrees with, say, the method for feature metadata collapsing? i.e., how do we resolve conflicts between user supplied data and inferred data?
  • For any candidate method, if it supplements the information provided by the user, how do we send a message back to the user that helps them differentiate between information they provided, and the inferred information?

@wasade
Copy link
Member

wasade commented Dec 21, 2020

It computes an f-measure based on the observed names which descend relative to the full tree. It can replicate names if needed, as is necessary for polyphyletic groups like clostrida, or can just place a name singularly based on the maximum f-score.

Backfilling is different from labeling unlabeled descendants. An unlabeled descendant's lineage is based on the observed taxa names in the path from tip -> root. Importantly, the re-labeling may chance the original descendants lineage, and this is a good thing as taxonomy != phylogeny and particularly for reference databases, the lineages applied to input records may be incorrect.

Backfilling is used to recover gaps that may arise. For example, if you have an internal node labeled "c__Clostridia", and between it and the root, there is "d__Bacteria" but no phylum name, then we have a gap in the taxonomy. It does not make sense to have a domain and class name without a phylum name. The input lineage information can be used to reconcile this, assuming the input taxonomy is rational. In this example, we can safely infer that "p__Firmicutes" exists in that path as "c__Clostridia" are nested within "p__Firmicutes" (...unless the input taxonomy suggests otherwise...). However, we cannot determine what the correct node for "p__Firmicutes" is; as such, the most conservative placement is chosen, which is the node already containing the "c__Clostridia" label.

@ElDeveloper
Copy link
Member

  • Does empress already allow for internal node labels to be provided by feature metadata? e.g., if a user has a method for labeling internal nodes, can they supply these labels?

Yes, the feature metadata inputed to Empress can refer to internal nodes or to tips. More frequently it refers exclusively to tips.

self.tip_md, self.int_md = filter_feature_metadata_to_tree(

  • If so, what happens when their feature metadata disagrees with, say, the method for feature metadata collapsing? i.e., how do we resolve conflicts between user supplied data and inferred data?

Good call. An implementation of any solution to this problem should account for existing metadata and only offer this "convenience" method when there is no internal node metadata. For example you can picture a situation where the "bare" internal node view is shown to the user with an option to "infer metadata from descendants". Clicking on a control like that, should then infer the metadata, style the resulting values in a different color, and show a warning that explains why they might want to exercise caution.


Thanks for chiming in everyone, this is very helpful! 🌳

@fedarko
Copy link
Collaborator

fedarko commented Dec 22, 2020

Just popping in (agreed with @ElDeveloper, this is an awesome discussion :D) --

If so, what happens when their feature metadata disagrees with, say, the method for feature metadata collapsing? i.e., how do we resolve conflicts between user supplied data and inferred data?

The default in EMPress' feature metadata coloring / clade collapsing is only respecting the feature metadata provided for the tips. It's possible to color by internal nodes' feature metadata, but doing this turns off the "propagation" of shared feature metadata up the tree, ensuring that conflicts are handled explicitly.

Default (don't use internal node feature metadata, and do "propagation"):
image

Allow coloring by internal node feature metadata, but disable "propagation":
image

Whatever solution(s) we end up going with for this, I agree that we should liberally show warnings that inferring things in this way is just an approximation and not the ground truth.


As a sidenote: this discussion brings up the mildly wonky point that, currently, EMPress treats each feature metadata field (including the various levels of taxonomy) as its own independent thing, ignoring other metadata fields. This means that, for example, if you color by Level 7 (species) in a 16S dataset using the default QIIME color map, you'll probably see a lot of clades of the tree colored as red due to all of the tips in the clade sharing a species classification of s__, even if they're from different genera/families/etc:

yike

Addressing this would definitely be possible, by for example representing the values in each Level N string as the full taxonomy to that point (e.g. setting Level 7 to k__Bacteria; p__Firmicutes; c__Somecoolclass; o__Ogeezimrunningoutoftaxonomynamesiknow; f__Isanyonereadingthis; g__Himom; s__ instead of just s__) -- in some ways this is similar to a point @antgonza raised a few weeks ago in #422.

@wasade
Copy link
Member

wasade commented Dec 22, 2020

Right, s__ is effectively null, so s__ != s__

@fedarko
Copy link
Collaborator

fedarko commented Dec 22, 2020

It can also be a problem with "real" names, unfortunately -- @lisa55asil brought this up in the context of Qurro a while back, there's fun stuff like P. gingivalis and H. gingivalis...

@tanaes
Copy link
Collaborator

tanaes commented Dec 22, 2020 via email

@wasade
Copy link
Member

wasade commented Dec 22, 2020

The species names should use genus / species to account for these scenarios. It should not be a problem for other portions of the taxonomy, unless the taxonomy is malformed. It would be crazy for c__Clostridia to associate with p__Firmicutes and p__Bacteroidetes, for example. tax2tree tests and requires the input taxonomy is actually a tree, so this scenario should be protected for already

@wasade
Copy link
Member

wasade commented Dec 22, 2020

...sorry, it's been a few years since looking at the code, the verification that the taxonomy is hierarchical may come from t2t validate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants