-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Project phylogeny up tree if provided #471
Comments
I'm not sure I understand how this would work as taxonomy != phylogeny. |
Thanks @gibsramen. I ment phylogeny not taxonomy |
In this context, I think taxonomy is actually what you want -- what does it mean to project phylogeny up a phylogeny? Taxonomic labels may or may not correspond to the estimated phylogenetic relationships, but in the case where there's no discordance (or in the case where discordance < some threshold), it is often nice to be able to inherit some external set of taxonomic labels using a phylogeny. Is that what you're meaning here? |
Agree with @tanaes For a concrete example, here is empress displaying taxonomy labels for a tip: And this is what it shows when a non-tip node is selected: Is the idea here that the internal nodes in the phylogeny could have some notion of the taxa that descend from it within the phylogeny? If so, it could make sense to label internal nodes with the lowest common taxonomy ancestor of the nodes that descend from it in the phylogeny. This should be relatively easy to compute if the fields that represent levels of a taxa are relatively standardized within feature metadata. (forgive me, I am not super familiar with these q2-types at the moment). But it could look something like this (note: this is pseudo-code):
I would imagine this is similar to what is currently being done for collapsing clades. This could also extend to projecting"other metadata fields. In general, we would need to be careful of places where it would not make sense to project the field up the tree (like confidence scores). |
@gwarmstrong that is the idea. Basically, label internal nodes with the lowest common taxonomy of its tips. |
Related to what @tanaes mentioned is there literature on what a good threshold would be in this case? We could maybe add an input users could specify but dunno what a good default would be. |
I guess I am not familiar enough with how the taxonomy is calculated to properly comment on this. But I would assume that the taxonomic level of internal nodes would match the lowest shared taxonomic level of its tips. |
For 16S taxonomy classification, taking a peek at at Tax2Tree as @tanaes metnioned, as well as https://github.com/qiime2/q2-feature-classifier should yield some answers. IIRC aligning the ASV/OTU/etc sequence against a reference, or using some other method, such as Naive Bayes to estimate the probability that a sequence is from a specific taxa are different ways one can classify taxonomy. For metagenomics, you could take a look at woltka, kraken2, metaphlan2 just for starters on the myriad ways that taxonomy is calculated, all with their own metrics on what constitutes a "good" hit for taxonomy. However, Empress is sequence/technology agnostic. So anything that estimates the taxonomy of some internal node using the sequence features is probably off the table (and should be, because this makes generalizing across 16S and metagenomics more difficult, or even across methods within a given sequencing techonology). I think the most general thing we could do here is expose the same feature projections used for feature metadata clade collapsing. |
|
...it will be more robust than the feature metadata clade collapsing. LCA does not work well for this, and getting the nesting of taxonomy ranks correct on placement can get tedious |
So it seems like tax2tree differs from LCA in that tax2tree (feel free to confirm/deny):
I think this raises some important points:
|
It computes an f-measure based on the observed names which descend relative to the full tree. It can replicate names if needed, as is necessary for polyphyletic groups like clostrida, or can just place a name singularly based on the maximum f-score. Backfilling is different from labeling unlabeled descendants. An unlabeled descendant's lineage is based on the observed taxa names in the path from tip -> root. Importantly, the re-labeling may chance the original descendants lineage, and this is a good thing as taxonomy != phylogeny and particularly for reference databases, the lineages applied to input records may be incorrect. Backfilling is used to recover gaps that may arise. For example, if you have an internal node labeled "c__Clostridia", and between it and the root, there is "d__Bacteria" but no phylum name, then we have a gap in the taxonomy. It does not make sense to have a domain and class name without a phylum name. The input lineage information can be used to reconcile this, assuming the input taxonomy is rational. In this example, we can safely infer that "p__Firmicutes" exists in that path as "c__Clostridia" are nested within "p__Firmicutes" (...unless the input taxonomy suggests otherwise...). However, we cannot determine what the correct node for "p__Firmicutes" is; as such, the most conservative placement is chosen, which is the node already containing the "c__Clostridia" label. |
Yes, the feature metadata inputed to Empress can refer to internal nodes or to tips. More frequently it refers exclusively to tips. Line 214 in d0a46ed
Good call. An implementation of any solution to this problem should account for existing metadata and only offer this "convenience" method when there is no internal node metadata. For example you can picture a situation where the "bare" internal node view is shown to the user with an option to "infer metadata from descendants". Clicking on a control like that, should then infer the metadata, style the resulting values in a different color, and show a warning that explains why they might want to exercise caution. Thanks for chiming in everyone, this is very helpful! 🌳 |
Just popping in (agreed with @ElDeveloper, this is an awesome discussion :D) --
The default in EMPress' feature metadata coloring / clade collapsing is only respecting the feature metadata provided for the tips. It's possible to color by internal nodes' feature metadata, but doing this turns off the "propagation" of shared feature metadata up the tree, ensuring that conflicts are handled explicitly. Default (don't use internal node feature metadata, and do "propagation"): Allow coloring by internal node feature metadata, but disable "propagation": Whatever solution(s) we end up going with for this, I agree that we should liberally show warnings that inferring things in this way is just an approximation and not the ground truth. As a sidenote: this discussion brings up the mildly wonky point that, currently, EMPress treats each feature metadata field (including the various levels of taxonomy) as its own independent thing, ignoring other metadata fields. This means that, for example, if you color by Addressing this would definitely be possible, by for example representing the values in each |
Right, |
It can also be a problem with "real" names, unfortunately -- @lisa55asil brought this up in the context of Qurro a while back, there's fun stuff like P. gingivalis and H. gingivalis... |
Yes, you definitely want to use full taxonomy strings (or equivalent) in
this scenario!
Well-defined taxonomies, like the NCBI taxonomy, have identifiers
assigned to each unique taxon level name that are probably what you want
to use for this purpose. Having the capacity to handle an explicit
external taxonomy in this way will probably enable all sorts of other
useful applications.
On December 21, 2020, Github Notifications ***@***.***> wrote:
It can also be a problem with "real" names, unfortunately --
@lisa55asil <https://github.com/lisa55asil> brought this up in the
context of Qurro a while back, there's fun stuff like P. gingivalis
<https://en.wikipedia.org/wiki/Porphyromonas_gingivalis> and H.
gingivalis
<https://en.wikipedia.org/wiki/Halicephalobus_gingivalis>...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#471 (comment)>,
or unsubscribe <https://github.com/notifications/unsubscribe-
auth/AB7ISAEWMMJAVKYLFFKZXUTSWABF5ANCNFSM4VB6GZNA>.
|
The species names should use genus / species to account for these scenarios. It should not be a problem for other portions of the taxonomy, unless the taxonomy is malformed. It would be crazy for c__Clostridia to associate with p__Firmicutes and p__Bacteroidetes, for example. tax2tree tests and requires the input taxonomy is actually a tree, so this scenario should be protected for already |
...sorry, it's been a few years since looking at the code, the verification that the taxonomy is hierarchical may come from |
It would be nice to have empress automatically label internal with a taxonomy if a user provided a taxonomy .qza file. I thought we had this feature already implemented but after a discussion with Imran, I realized this is not currently implemented.
The text was updated successfully, but these errors were encountered: