Use ancestor information in taxonomy feature metadata (only computing ancestor info in the JS interface) #487

fedarko · 2021-02-17T05:03:36Z

This is an alternate version of #482 that avoids dramatically increasing the size of the taxonomy information stored in the QZV -- instead, each entry in the feature metadata is "expanded" as needed in the JS. This means that in the worst case (visualizing the lowest level of a taxonomy, e.g. "Level 7" / species) we just need to stitch back together the original taxonomy for every tip described in the feature metadata (this process could be made more efficient -- see comment added in a95b652).

This works by passing a list of the "split taxonomy columns", in descending rank order (i.e. Level 1, Level 2, ...) from the Python to the JS code, which (in certain contexts) uses this list to determine how much if any ancestor data to get for a given feature metadata column. Doing things in this way may make life easier in the future, since it'll let us name the taxonomy columns (after creating them) whatever we want, so we can do things like use 0-indexing / use a different term than Level / etc. if desired without breaking this functionality.

As with #482, this closes #473 and closes #422.

See docs enclosed for discussion vs. bool flag approach

Closes biocore#473 and closes biocore#422. Avoids having to store this stuff in the QZV, which will save a lot of space. This does still involve storing the "expanded" taxonomy for a given level all at once in memory, though -- I do not think this will be horrendous, since it is not that much extra data all things considered.

still gotta test (manually then automatically) tho...

not done yet

for sake of future consistency. really shouldn't change anything, behavior- or performance-wise

makes this slightly larger, i think

fedarko · 2021-02-17T05:28:09Z

One thing worth noting: the taxonomy information shown for biplot arrows is still stored as before, so now the two representations are different:

I don't think this is a big problem, esp. since the colors were already different between biplot arrows and EMPress' assigned colors (since biplot arrows are [usually] a subset of the tips in the tree), but it may be worth trying to address this in a future issue or something (maybe doing the unwieldy Python metadata operation in #482, but just for the biplot arrow taxonomies...?)

kwcantrell

Thanks again @fedarko! I definitely like concept of the PR, I just have a couple questions regarding performance.

kwcantrell · 2021-02-18T02:34:33Z

empress/support_files/js/empress.js

+            for (var i = 0; i < taxIdx; i++) {
+                var currTaxCol = this._splitTaxonomyColumns[i];
+                var currTaxColFMIdx = _.indexOf(
+                    this._featureMetadataColumns,
+                    currTaxCol
+                );
+                ancestorFMIndices.push(currTaxColFMIdx);
+            }
+            // We already know the index of the column we end at, so just put
+            // it here at the end manually. (Saving this extra work probably
+            // won't make an appreciable time difference, but it feels nice :)
+            ancestorFMIndices.push(fmIdx);
+            return function (fmRow) {
+                var totalFMVal = "";
+                _.each(ancestorFMIndices, function (ancestorFMIdx, ii) {
+                    // Separate adjacent levels in the resulting f.m. value
+                    // shown: e.g. "k__Bacteria; p__Cyanobacteria"
+                    if (ii > 0) {
+                        totalFMVal += "; ";
+                    }
+                    totalFMVal += fmRow[ancestorFMIdx];
+                });


This seems like a lot of wasted computation (i.e. worst case, for each node we have to iterate though the feautreMetadataColumn to find the location of each individual level, making this an O(n*L^2) where n = # nodes and L = # levels).
Since the taxonomy columns are in order, would we be able to just find the location of the first tax column and the location of the level we want to show? For example, the pseudo-code to find the Level 5 would look something like:
level1Idx = location of Level 1 in this._featureMetadata
fmIdx = location of Level 5 in this._featureMetadata
for ii=level1Idx; i <= fmIdx
----if (ii != level1Idx)
-------- totalFMVal += "; "
----totalFMVal += fmRow[ii]

worst case, for each node we have to iterate though the feautreMetadataColumn to find the location of each individual level, making this an O(n*L^2) where n = # nodes and L = # levels

I am not sure about this: the function returned here is part of a closure which includes ancestorFMIndices, and this returned function is what is called on every node (more precisely, for every row in the feature metadata being considered). So finding the location of the level columns in the feature metadata should only be done once*.

The only work that is repeated for each row in the feature metadata is the stuff in function (fmRow) { ... }, which only involves retrieving L specific elements from an array (and we already know the indices of these elements, so each retrieval is ~O(1) -- so the total is O(nL), I think?)

Since the taxonomy columns are in order, would we be able to just find the location of the first tax column and the location of the level we want to show?

Yeah, this would be doable. I think your point is more that the _.each() (as well as assembling ancestorFMIndices in the first place) is inefficient, compared to a basic for loop like the one you showed? I guess so -- the reason for doing things this way was to avoid relying on the assumption that split-up taxonomy columns (Level 1, Level 2, ...) will always occur one after another in the feature metadata columns. This is currently always the case, but I guess it could change in the future (hence this function being written in such a roundabout way).

If you'd like I can rewrite things so that the assumption of split-up levels being in order one after another is explicitly tested in Python, and relied on here in JS. Does this sound good?

* "Once" for every coloring operation using feature metadata, at least. Now that I think about it we could probably store an object mapping split taxonomy column names to indices in the feature metadata columns and avoid all the indexOfs, but I'm not convinced this will save much time for most datasets with a normal amount of taxonomy levels.

Yeah, this would be doable. I think your point is more that the _.each() (as well as assembling ancestorFMIndices in the first place) is inefficient, compared to a basic for loop like the one you showed? I guess so -- the reason for doing things this way was to avoid relying on the assumption that split-up taxonomy columns (Level 1, Level 2, ...) will always occur one after another in the feature metadata columns. This is currently always the case, but I guess it could change in the future (hence this function being written in such a roundabout way).

I see what you mean. Your current method is definitely future proof and relies on less assumptions which would make it easy for future developers to understand + make the code base more stable if we were to change things around. I wouldn't worry about changing anything unless we find out that this does greatly impact performance then we can start think about different ways to quickly locate the location of each level without having to iterate.

kwcantrell · 2021-02-18T02:37:17Z

empress/support_files/js/empress.js

+        // NOTE that, for drawing a barplot layer representing a taxonomy
+        // column, we are essentially doing the work here of "reassembling"
+        // the ancestor taxonomy with the child taxonomy info twice -- once
+        // when we call this.getUniqueFeatureMetadataInfo() above, and again
+        // as we go through this loop and look at each tip (this is done for
+        // each tip by the "retrieval function" mentioned below).
+        //
+        // It would be ideal to use the mapping information returned by
+        // this.getUniqueFeatureMetadataInfo() to avoid having to repeat this
+        // work, although this would likely require restructuring the rest
+        // of this function -- might be too much work for its own good.


@fedarko by chance did you profile this on a large tree? I worry that this might impact performance.

I haven't yet, but I agree that that is a well-founded concern. I guess checking this with the EMP tree used in the paper would make sense -- @ElDeveloper, do you have any other suggestions for good benchmark datasets?

Looking at the Fig. 1 notebook for the paper, it seems like the taxonomy is already split up in the EMP BIOM table -- I guess the proper way to benchmark this would be:

Merge the various taxonomy levels in the feature metadata file produced by the Fig. 1 notebook into a single Taxonomy column in a new feature metadata file, which EMPress' python code will then be able to automatically detect and split up

Run empress tree-plot -t [newick file] -fm [feature metadata file generated above] on two versions of EMPress:

The version of EMPress currently on the master branch

The version of EMPress in this PR

Then, for each approach, try using feature metadata coloring to color the tree by each level in the taxonomy (1, 2, 3, 4, 5, 6, 7), and benchmark how long each level takes in each visualization (a total of 14 measurements). The goal is that this PR's EMPress version shouldn't take dramatically longer, I guess.

That being said, for what it's worth - using the Fig. 1 EMPire plot QZV, I can already see that when coloring by Level 7 (species) the majority of tips are colored either the color of s__ or Unspecified. This PR will break these all up into separate categories based on their ancestral taxonomy, which will degrade performance due to adding more stuff to the legend / Colorer / etc., but I think this tradeoff is worth it. (Maybe we could add an option to the settings to disable this if users want?)

The EMP data should be good to test with.

Come to think about it. Do we even need to split the taxonomy in Python anymore? I seem to recall the only reason we did this was to show the feature metadata summaries in the Node Info Pop-up (IIRC). Can we do that on the fly or do we have any other use for the split levels?

Come to think about it. Do we even need to split the taxonomy in Python anymore? I seem to recall the only reason we did this was to show the feature metadata summaries in the Node Info Pop-up (IIRC). Can we do that on the fly or do we have any other use for the split levels?

I don't think the pop-up was the only reason we added splitting -- it just generally makes life easier. I suspect that doing splitting on the fly might be more trouble than it is worth. Currently the splitting operation in Python is done using pandas vectorization, and I'm not sure if doing this quickly + on the fly in JS will be feasible.

Pre-splitting the taxonomy, like we do currently, has the advantage that it stores the taxonomic levels in separate "columns" of the feature metadata, so reconstructing these is just a matter of gluing together some of these columns for each metadata row as needed -- to me it seems both conceptually easier and more efficient than having to split up the taxonomy on the fly, which seems harder (since we don't know in advance where the ; characters are for a given taxonomy string -- we could probably use some fancy heuristics to speed that up, but again might be more trouble than it's worth).

That said, there are two main benefits I see to moving more of this to JS. One is that not doing splitting in Python would allow us to avoid the zillion Unspecified string values being stored in QZVs -- however, this should be (mostly) addressed by #337. The other benefit I see is that adding "splitting" utilities to the JS will make it easier to let users control this behavior themselves in the app -- for example if their taxonomy column is named something weird and wasn't treated as an actual taxonomy column by the current Python code. I think it'd make sense to defer that sort of work to the future, since it will likely require some significant restructuring of the codebase.

Fair enough, that makes sense. Thanks for the explanation!

Benchmarking is done! Ran both this PR's version of EMPress and the master version of EMPress on the EMP EMPire plot QZV (Fig. 1A) from the paper. This was done using my laptop with 8 GB of RAM, so fancier computers will probably be much faster (it already takes its sweet time just loading the EMP dataset).

Both versions of EMPress were provided a modified EMP taxonomy file that was created by just merging all 7 columns (so that it would be handled specially as a "Taxonomy column" -- only really makes a difference for this PR), and were run using the circular layout and otherwise default settings as far as I can tell. As a disclaimer, some of these measurements were taken using my phone's stopwatch and some were done by just eyeballing my laptop clock, and I only took each measurement once (...so as with most things in life we would probably need a larger sample size).

Using the master branch of EMPress (as of writing)

Feature metadata coloring

Starting at Level 4, I started getting the standard "page unresponsive" dialog, which I either clicked "Wait" on or ignored.

Feature metadata column Time taken (seconds; approximate)

Level 1 4.73 seconds

Level 2 10.83 seconds

Level 3 22.44 seconds

Level 4 34 seconds

Level 5 42 seconds

Level 6 1 minute and 38 seconds

Level 7 52 seconds (???)

Level 7 screenshot

To prove that my computer didn't explode ;)

I suspect the reason coloring by Level 7 went comparatively quickly is that a ton of the tips here (145,171 / 317,314 [45.7%] of the entries in the EMP taxonomy file) have either s__ or Unspecified as their species-level classification; this is why the tree looks overwhelmingly either red (the color that got assigned for s__) or pale greyish (the color that got assigned for Unspecified). As mentioned earlier, although this does make drawing faster, it is not a good thing for interpretation (intuitively, the tree shouldn't look so homogenous -- all of these s__s span a really wide portion of the taxonomy, so they shouldn't all have the same color / label).

Using this PR's version of EMPress (up to date with the master branch as of writing)

Feature metadata coloring

The "page unresponsive" dialogs started coming up at Level 3 (rather than Level 4) here.

Feature metadata column Time taken (seconds; approximate) Extra time compared with master branch

Level 1 5 seconds 0.27 seconds

Level 2 14 seconds 3.17 seconds

Level 3 50 seconds 27.56 seconds

Level 4 1 minute and 40 seconds 1 minute and 6 seconds

Level 5 2 minutes and 6 seconds 1 minute and 24 seconds

Level 6 4 minutes and 37 seconds 2 minutes and 59 seconds

Level 7 5 minutes and 58 seconds 5 minutes and 6 seconds

Level 7 screenshot

The reason the tree has a lot of light blue is that this is the color assigned to the feature metadata string k__Bacteria; Unspecified; Unspecified; Unspecified; Unspecified; Unspecified; Unspecified (i.e. for ASVs where the original taxonomy string was just "k__Bacteria" -- there aren't any levels beyond "kingdom" given). What is nice about this PR is we can be more sure about this -- and also we can see flashes of other colors interspersed with the blue, since all of the s__ taxa are no longer colored identically.

Discussion

As expected, this PR slows down coloring operations using taxonomy columns, especially for more specific taxonomic levels. However, I'm happy that it at least works on my laptop with the full EMP dataset -- IMO this dataset is right around the upper limit of what EMPress can handle right now, at least including all of the extra data (sample and feature metadata, BIOM table, PCoA) that is being loaded alongside the bare tree structure.

I guess the key question is are these slowdowns worth it? It is not entirely my call, but I vote yes -- I think the interpretation / safety benefits (of avoiding the awkward situations described in #473) vastly outweigh the slowdowns (which should only be a big deal for really large datasets like this one?). I'd argue that although coloring by e.g. Level 7 is much faster in the master branch, the resulting visualization is not very useful.

There are lots of ways we can counteract the slowdowns -- #337 and #371 are both issues I have prototypes of, and I want to get to when I have some time. We could also maybe add an option to the settings that toggles this behavior, so that if users want the original behavior (faster coloring operations at the cost of collapsing certain unrelated taxonomies) they can get that.

...Anyway, please let me know what you think!

Thanks @fedarko for bench marking this! I like your idea of providing options to toggle between this method and the current method.

I think what we can do, at least for now, is set this method as our default. However, I think the majority of users would not think about going into the settings menu on their own to look for a Fast Taxonomy Method option which may lead to a lot of frustration if it takes 1+ minutes to color by taxonomy. So, it would be nice if we provide some sort of warning to users to let them it may take a while to process and that they can change to old faster method. Or better yet, in the feature/barplot panels have a Fast Taxonomy Method checkbox appear whenever a taxonomy column is selected (similar to how the continuous checkbox is handled). Essentially have some mechanism in place that makes it obvious that there is a quick and dirty method to color by taxonomy.

I guess the key question is are these slowdowns worth it?

My take is that they are worth it. This is the correct way to group features and color the tree. And also, this is the strategy used in QIIME2's taxonomy barplots. Having inconsistent ways to display data within a platform is something we should always try to avoid.

My opinion would be to not focus on adding a "fast taxonomy UI component" but instead to focus on fixing #371. Because that is likely the BEST solution to this problem. By default we color the top N most frequent groups and if the user wants they can change the value of N. N can be represented as a slider in the feature coloring tab.

Thanks for the feedback @kwcantrell and @ElDeveloper!

I agree that focusing on #371 might make sense, at least right now. Something that might complement #371 well is -- in Python -- figuring out the number of unique values in each feature / sample metadata field (or at least seeing if a given field has above a certain number of unique values), and passing these values to the HTML. Then, we can use these values to adjust the way the various coloring options' "defaults" show up in the UI.

I guess, less from a performance standpoint and more from a aesthetic one, some users might prefer to see just the values for a specific taxonomy column (e.g. only seeing phylum names without the k__ stuff unless they actually need it to disambiguate things) -- I have opened a new issue (#489) to discuss this, since I think accommodating this in a reasonable way will be nontrivial.

Is this PR ok to merge?

this looks good to me. I will go ahead an merge it. Quick note on #371, everything will need to be done in js since python will be unable to account for tree shearing.

…473-but-not-slow

ElDeveloper

Thanks @fedarko. @kwcantrell feel free to merge if everything looks good to you.

ElDeveloper · 2021-03-30T20:17:47Z

@fedarko, could you resolve conflicts? @kwcantrell can you take another look at this PR and merge if it looks good to you?

…473-but-not-slow

fedarko · 2021-03-30T21:30:17Z

Thanks, just resolved the conflict; PR should be ok to merge pending approval from @kwcantrell.

After barplot clicking, the next thing I have on my plate for EMPress is #371 (drawing only the top K most frequent values in a field) -- as Yoshiki suggested, I think that will work really well with this PR's changes.

emperor-helper · 2021-03-30T21:34:37Z

The following artifacts were built for this PR: empire-biplot.qzv, empire.qzv, empress-tree.qzv, just-fm.qzv, plain.qzv

ElDeveloper · 2021-03-31T17:20:17Z

Thanks so much @fedarko!

…

On (Mar-30-21|14:30), Marcus Fedarko wrote: Thanks, just resolved the conflict; PR should be ok to merge pending approval from @kwcantrell. After barplot clicking, the next thing I have on my plate for EMPress is #371 (drawing only the top K most frequent values in a field) -- as Yoshiki suggested, I think that will work really well with this PR's changes. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://urldefense.com/v3/__https://github.com/biocore/empress/pull/487*issuecomment-810590583__;Iw!!Mih3wA!QbAwxX-sKZ5BEx4jz39Guti_-OzcXaDrco44iIpdXgygHpK6VEcEgFTOnABfMAM$

ElDeveloper · 2021-04-08T02:02:15Z

@kwcantrell any chance you can look at this PR, it's pending your approval so feel free to merge after you approve. 👍

kwcantrell · 2021-04-08T15:57:18Z

Thanks @fedarko! The extra run time is not ideal but this method represents the correct way to display taxonomy and once #371 is merged in, the run time wont be that big of an issue.

ElDeveloper · 2021-04-08T16:07:49Z

Thanks so much @fedarko and @kwcantrell !

fedarko added 15 commits February 9, 2021 18:01

DOC: various match_inputs() comments/doc fixes

5993819

ENH: Pass ordered tax col names to Empress JS biocore#473

095281b

See docs enclosed for discussion vs. bool flag approach

STY: prettify

2b8fed0

MNT: abstract fm retrieval func generation

850f4c4

still gotta test (manually then automatically) tho...

TST: unbreak js tests

e98d7f0

TST: unbreak some of the python tests

c8ab49d

not done yet

TST: unbreak all python tests

9248137

TST: abstract python taxcol checking

32075f7

STY: fix flake8 long lines

3d1976d

STY: more flake8 fixes

e73c821

TST: test biocore#473 special fm stuff in JS

2248166

MNT: pass length fm retrieval through new funcs

805d3cc

for sake of future consistency. really shouldn't change anything, behavior- or performance-wise

BUG: Change max width to be in vw units, not vh

37448ca

makes this slightly larger, i think

PERF/DOC: Add note abt repeated work - fm barplots

a95b652

fedarko mentioned this pull request Feb 17, 2021

[DON'T MERGE THIS PLS] Include ancestor information in processed taxonomy feature metadata #482

Closed

ElDeveloper requested a review from kwcantrell February 17, 2021 19:31

kwcantrell reviewed Feb 18, 2021

View reviewed changes

Merge branch 'master' of https://github.com/biocore/empress into fix-…

1ffe0f1

…473-but-not-slow

fedarko mentioned this pull request Feb 26, 2021

Support alternate (smaller) display for taxonomy levels with ancestor information #489

Open

ElDeveloper approved these changes Mar 1, 2021

View reviewed changes

Merge branch 'master' of https://github.com/biocore/empress into fix-…

f6e4cc1

…473-but-not-slow

kwcantrell merged commit 5b8a340 into biocore:master Apr 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use ancestor information in taxonomy feature metadata (only computing ancestor info in the JS interface) #487

Use ancestor information in taxonomy feature metadata (only computing ancestor info in the JS interface) #487

fedarko commented Feb 17, 2021

fedarko commented Feb 17, 2021

kwcantrell left a comment

kwcantrell Feb 18, 2021

fedarko Feb 18, 2021 •

edited

Loading

kwcantrell Feb 22, 2021

kwcantrell Feb 18, 2021

fedarko Feb 18, 2021

ElDeveloper Feb 19, 2021

fedarko Feb 20, 2021

ElDeveloper Feb 22, 2021

fedarko Feb 26, 2021

kwcantrell Feb 26, 2021

ElDeveloper Feb 26, 2021

fedarko Feb 26, 2021

kwcantrell Apr 8, 2021

ElDeveloper left a comment

ElDeveloper commented Mar 30, 2021

fedarko commented Mar 30, 2021

emperor-helper commented Mar 30, 2021

ElDeveloper commented Mar 31, 2021 via email

ElDeveloper commented Apr 8, 2021

kwcantrell commented Apr 8, 2021

ElDeveloper commented Apr 8, 2021

Feature metadata column	Time taken (seconds; approximate)
Level 1	4.73 seconds
Level 2	10.83 seconds
Level 3	22.44 seconds
Level 4	34 seconds
Level 5	42 seconds
Level 6	1 minute and 38 seconds
Level 7	52 seconds (???)

Feature metadata column	Time taken (seconds; approximate)	Extra time compared with master branch
Level 1	5 seconds	0.27 seconds
Level 2	14 seconds	3.17 seconds
Level 3	50 seconds	27.56 seconds
Level 4	1 minute and 40 seconds	1 minute and 6 seconds
Level 5	2 minutes and 6 seconds	1 minute and 24 seconds
Level 6	4 minutes and 37 seconds	2 minutes and 59 seconds
Level 7	5 minutes and 58 seconds	5 minutes and 6 seconds

Use ancestor information in taxonomy feature metadata (only computing ancestor info in the JS interface) #487

Use ancestor information in taxonomy feature metadata (only computing ancestor info in the JS interface) #487

Conversation

fedarko commented Feb 17, 2021

fedarko commented Feb 17, 2021

kwcantrell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fedarko Feb 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Using the master branch of EMPress (as of writing)

Feature metadata coloring

Level 7 screenshot

Using this PR's version of EMPress (up to date with the master branch as of writing)

Feature metadata coloring

Level 7 screenshot

Discussion

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ElDeveloper left a comment

Choose a reason for hiding this comment

ElDeveloper commented Mar 30, 2021

fedarko commented Mar 30, 2021

emperor-helper commented Mar 30, 2021

ElDeveloper commented Mar 31, 2021 via email

ElDeveloper commented Apr 8, 2021

kwcantrell commented Apr 8, 2021

ElDeveloper commented Apr 8, 2021

fedarko Feb 18, 2021 •

edited

Loading