Project phylogeny up tree if provided #471

kwcantrell · 2020-12-19T02:52:44Z

It would be nice to have empress automatically label internal with a taxonomy if a user provided a taxonomy .qza file. I thought we had this feature already implemented but after a discussion with Imran, I realized this is not currently implemented.

gibsramen · 2020-12-19T16:29:44Z

I'm not sure I understand how this would work as taxonomy != phylogeny.

kwcantrell · 2020-12-19T16:34:50Z

Thanks @gibsramen. I ment phylogeny not taxonomy

tanaes · 2020-12-19T17:37:54Z

In this context, I think taxonomy is actually what you want -- what does it mean to project phylogeny up a phylogeny?

Taxonomic labels may or may not correspond to the estimated phylogenetic relationships, but in the case where there's no discordance (or in the case where discordance < some threshold), it is often nice to be able to inherit some external set of taxonomic labels using a phylogeny. Is that what you're meaning here?

gwarmstrong · 2020-12-19T18:43:57Z

Agree with @tanaes

For a concrete example, here is empress displaying taxonomy labels for a tip:

And this is what it shows when a non-tip node is selected:

Is the idea here that the internal nodes in the phylogeny could have some notion of the taxa that descend from it within the phylogeny?

If so, it could make sense to label internal nodes with the lowest common taxonomy ancestor of the nodes that descend from it in the phylogeny. This should be relatively easy to compute if the fields that represent levels of a taxa are relatively standardized within feature metadata. (forgive me, I am not super familiar with these q2-types at the moment).

But it could look something like this (note: this is pseudo-code):

for node in postOrderTraveral(tree):
    if not isLeaf(node): 
        for level in node.taxonomy:
            allSame, value = allChildrenHaveSameValue(node, level)
            if allSame:
                node.taxonomy[level] = value

I would imagine this is similar to what is currently being done for collapsing clades.

This could also extend to projecting"other metadata fields. In general, we would need to be careful of places where it would not make sense to project the field up the tree (like confidence scores).

kwcantrell · 2020-12-19T18:53:36Z

Is the idea here that the internal nodes in the phylogeny could have some notion of the taxa that descend from it within the phylogeny?

If so, it could make sense to label internal nodes with the lowest common taxonomy ancestor of the nodes that descend from it in the phylogeny. This should be relatively easy to compute if the fields that represent levels of a taxa are relatively standardized within feature metadata. (forgive me, I am not super familiar with these q2-types at the moment).

@gwarmstrong that is the idea. Basically, label internal nodes with the lowest common taxonomy of its tips.

gibsramen · 2020-12-19T18:57:19Z

Related to what @tanaes mentioned is there literature on what a good threshold would be in this case? We could maybe add an input users could specify but dunno what a good default would be.

kwcantrell · 2020-12-19T19:01:00Z

I guess I am not familiar enough with how the taxonomy is calculated to properly comment on this. But I would assume that the taxonomic level of internal nodes would match the lowest shared taxonomic level of its tips.

tanaes · 2020-12-19T19:20:40Z

Worth taking a look at Tax2Tree from our very own @wasade!

gwarmstrong · 2020-12-19T19:25:03Z

For 16S taxonomy classification, taking a peek at at Tax2Tree as @tanaes metnioned, as well as https://github.com/qiime2/q2-feature-classifier should yield some answers. IIRC aligning the ASV/OTU/etc sequence against a reference, or using some other method, such as Naive Bayes to estimate the probability that a sequence is from a specific taxa are different ways one can classify taxonomy.

For metagenomics, you could take a look at woltka, kraken2, metaphlan2 just for starters on the myriad ways that taxonomy is calculated, all with their own metrics on what constitutes a "good" hit for taxonomy.

However, Empress is sequence/technology agnostic. So anything that estimates the taxonomy of some internal node using the sequence features is probably off the table (and should be, because this makes generalizing across 16S and metagenomics more difficult, or even across methods within a given sequencing techonology).

I think the most general thing we could do here is expose the same feature projections used for feature metadata clade collapsing.

wasade · 2020-12-21T16:51:35Z

q2-feature-classifier won't place internal node labels. tax2tree will place labels on internal nodes and contention in placements. It's inputs are a phylogeny and a file containing tip -> lineage strings, and is agnostic to 16S/WGS. A visual example of the algorithm can be seen here

wasade · 2020-12-21T16:53:38Z

...it will be more robust than the feature metadata clade collapsing. LCA does not work well for this, and getting the nesting of taxonomy ranks correct on placement can get tedious

gwarmstrong · 2020-12-21T18:00:25Z

So it seems like tax2tree differs from LCA in that tax2tree (feel free to confirm/deny):

Does not require all descendants of an internal node to share the same label at a given level. (e.g., an internal node could be assigned to g__Clostridium, even if it has some small proportion of descendants from g__Dorea).
Newly labeled internal nodes can be used to label unlabeled descendants (known as back-filling).

I think this raises some important points:

Does empress already allow for internal node labels to be provided by feature metadata? e.g., if a user has a method for labeling internal nodes, can they supply these labels?
- If so, what happens when their feature metadata disagrees with, say, the method for feature metadata collapsing? i.e., how do we resolve conflicts between user supplied data and inferred data?
For any candidate method, if it supplements the information provided by the user, how do we send a message back to the user that helps them differentiate between information they provided, and the inferred information?

wasade · 2020-12-21T18:21:40Z

It computes an f-measure based on the observed names which descend relative to the full tree. It can replicate names if needed, as is necessary for polyphyletic groups like clostrida, or can just place a name singularly based on the maximum f-score.

Backfilling is different from labeling unlabeled descendants. An unlabeled descendant's lineage is based on the observed taxa names in the path from tip -> root. Importantly, the re-labeling may chance the original descendants lineage, and this is a good thing as taxonomy != phylogeny and particularly for reference databases, the lineages applied to input records may be incorrect.

Backfilling is used to recover gaps that may arise. For example, if you have an internal node labeled "c__Clostridia", and between it and the root, there is "d__Bacteria" but no phylum name, then we have a gap in the taxonomy. It does not make sense to have a domain and class name without a phylum name. The input lineage information can be used to reconcile this, assuming the input taxonomy is rational. In this example, we can safely infer that "p__Firmicutes" exists in that path as "c__Clostridia" are nested within "p__Firmicutes" (...unless the input taxonomy suggests otherwise...). However, we cannot determine what the correct node for "p__Firmicutes" is; as such, the most conservative placement is chosen, which is the node already containing the "c__Clostridia" label.

ElDeveloper · 2020-12-21T22:07:23Z

Does empress already allow for internal node labels to be provided by feature metadata? e.g., if a user has a method for labeling internal nodes, can they supply these labels?

Yes, the feature metadata inputed to Empress can refer to internal nodes or to tips. More frequently it refers exclusively to tips.

empress/empress/core.py

Line 214 in d0a46ed

self.tip_md, self.int_md = filter_feature_metadata_to_tree(

If so, what happens when their feature metadata disagrees with, say, the method for feature metadata collapsing? i.e., how do we resolve conflicts between user supplied data and inferred data?

Good call. An implementation of any solution to this problem should account for existing metadata and only offer this "convenience" method when there is no internal node metadata. For example you can picture a situation where the "bare" internal node view is shown to the user with an option to "infer metadata from descendants". Clicking on a control like that, should then infer the metadata, style the resulting values in a different color, and show a warning that explains why they might want to exercise caution.

Thanks for chiming in everyone, this is very helpful! 🌳

fedarko · 2020-12-22T02:23:54Z

Just popping in (agreed with @ElDeveloper, this is an awesome discussion :D) --

If so, what happens when their feature metadata disagrees with, say, the method for feature metadata collapsing? i.e., how do we resolve conflicts between user supplied data and inferred data?

The default in EMPress' feature metadata coloring / clade collapsing is only respecting the feature metadata provided for the tips. It's possible to color by internal nodes' feature metadata, but doing this turns off the "propagation" of shared feature metadata up the tree, ensuring that conflicts are handled explicitly.

Default (don't use internal node feature metadata, and do "propagation"):

Allow coloring by internal node feature metadata, but disable "propagation":

Whatever solution(s) we end up going with for this, I agree that we should liberally show warnings that inferring things in this way is just an approximation and not the ground truth.

As a sidenote: this discussion brings up the mildly wonky point that, currently, EMPress treats each feature metadata field (including the various levels of taxonomy) as its own independent thing, ignoring other metadata fields. This means that, for example, if you color by Level 7 (species) in a 16S dataset using the default QIIME color map, you'll probably see a lot of clades of the tree colored as red due to all of the tips in the clade sharing a species classification of s__, even if they're from different genera/families/etc:

Addressing this would definitely be possible, by for example representing the values in each Level N string as the full taxonomy to that point (e.g. setting Level 7 to k__Bacteria; p__Firmicutes; c__Somecoolclass; o__Ogeezimrunningoutoftaxonomynamesiknow; f__Isanyonereadingthis; g__Himom; s__ instead of just s__) -- in some ways this is similar to a point @antgonza raised a few weeks ago in #422.

wasade · 2020-12-22T02:30:08Z

Right, s__ is effectively null, so s__ != s__

fedarko · 2020-12-22T02:39:12Z

It can also be a problem with "real" names, unfortunately -- @lisa55asil brought this up in the context of Qurro a while back, there's fun stuff like P. gingivalis and H. gingivalis...

tanaes · 2020-12-22T13:46:49Z

Yes, you definitely want to use full taxonomy strings (or equivalent) in this scenario! Well-defined taxonomies, like the NCBI taxonomy, have identifiers assigned to each unique taxon level name that are probably what you want to use for this purpose. Having the capacity to handle an explicit external taxonomy in this way will probably enable all sorts of other useful applications.

On December 21, 2020, Github Notifications ***@***.***> wrote: It can also be a problem with "real" names, unfortunately -- @lisa55asil <https://github.com/lisa55asil> brought this up in the context of Qurro a while back, there's fun stuff like P. gingivalis <https://en.wikipedia.org/wiki/Porphyromonas_gingivalis> and H. gingivalis <https://en.wikipedia.org/wiki/Halicephalobus_gingivalis>...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#471 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe- auth/AB7ISAEWMMJAVKYLFFKZXUTSWABF5ANCNFSM4VB6GZNA>.

wasade · 2020-12-22T16:52:40Z

The species names should use genus / species to account for these scenarios. It should not be a problem for other portions of the taxonomy, unless the taxonomy is malformed. It would be crazy for c__Clostridia to associate with p__Firmicutes and p__Bacteroidetes, for example. tax2tree tests and requires the input taxonomy is actually a tree, so this scenario should be protected for already

wasade · 2020-12-22T16:55:35Z

...sorry, it's been a few years since looking at the code, the verification that the taxonomy is hierarchical may come from t2t validate

kwcantrell changed the title ~~Project taxonomy up tree if provided~~ Project phylogeny up tree if provided Dec 19, 2020

fedarko added feature request question labels Dec 22, 2020

fedarko mentioned this issue Jan 19, 2021

Respect ancestors in feature metadata coloring/propagation #473

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project phylogeny up tree if provided #471

Project phylogeny up tree if provided #471

kwcantrell commented Dec 19, 2020

gibsramen commented Dec 19, 2020

kwcantrell commented Dec 19, 2020

tanaes commented Dec 19, 2020

gwarmstrong commented Dec 19, 2020

kwcantrell commented Dec 19, 2020

gibsramen commented Dec 19, 2020

kwcantrell commented Dec 19, 2020

tanaes commented Dec 19, 2020

gwarmstrong commented Dec 19, 2020

wasade commented Dec 21, 2020

wasade commented Dec 21, 2020

gwarmstrong commented Dec 21, 2020

wasade commented Dec 21, 2020 •

edited

Loading

ElDeveloper commented Dec 21, 2020

fedarko commented Dec 22, 2020

wasade commented Dec 22, 2020

fedarko commented Dec 22, 2020

tanaes commented Dec 22, 2020 via email

wasade commented Dec 22, 2020

wasade commented Dec 22, 2020

Project phylogeny up tree if provided #471

Project phylogeny up tree if provided #471

Comments

kwcantrell commented Dec 19, 2020

gibsramen commented Dec 19, 2020

kwcantrell commented Dec 19, 2020

tanaes commented Dec 19, 2020

gwarmstrong commented Dec 19, 2020

kwcantrell commented Dec 19, 2020

gibsramen commented Dec 19, 2020

kwcantrell commented Dec 19, 2020

tanaes commented Dec 19, 2020

gwarmstrong commented Dec 19, 2020

wasade commented Dec 21, 2020

wasade commented Dec 21, 2020

gwarmstrong commented Dec 21, 2020

wasade commented Dec 21, 2020 • edited Loading

ElDeveloper commented Dec 21, 2020

fedarko commented Dec 22, 2020

wasade commented Dec 22, 2020

fedarko commented Dec 22, 2020

tanaes commented Dec 22, 2020 via email

wasade commented Dec 22, 2020

wasade commented Dec 22, 2020

wasade commented Dec 21, 2020 •

edited

Loading