-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use alternative clustering package in sourmash plot, to support larger data sets #274
Comments
I believe @taylorreiter has run into this with large data sets. There's no
good simple solution AFAIK; the scip cluster hierarchy code just doesn't
like so many samples! A solution might be to output the comparison matrix with
the --csv output for 'compare' in the latest sourmash master and import it
into a program that handles large clusters better.
(This problem is serious and is part of the motivation behind issues:
#256
#225
but we don't have a solution yet!)
|
from @luizirber when I ran in to this problem: I did what @ctb suggested and output the matrix as a csv and used R to make a dendrogram without the heatmap. Quick and dirty R code:
|
Maybe it's time to add a dependency on [fastcluster][0] and change the plot
code to use it? (It's the same library @taylorreiter suggested in her R
solution)
[0]: https://pypi.python.org/pypi/fastcluster
|
sounds like something worth exploring for sure! concerned about adding
more dependencies tho.
|
Thanks for the suggestions. I will try the csv option, and and also try running compare on a reduced subset of good bins. Should it run OK on 200 bins? 500? Is there a good reason to use recursion in the code rather than iteration? That seems to be the root of the problem. |
We're using the scipy.cluster.hierarchy package as a black box, so you'd
have to ask them why recursion :). I've personally plotted 300x300 on my
laptop without any trouble.
The question of what approach to include in the sourmash package itself is
an interesting one - so far we've chosen something that is straightforward to
install and community supported, but we haven't put a lot of thought into it
(or at least I haven't). Your experience is valuable in suggesting that
we choose something else more scalable as a default.
But hey, at least we finally let you export CSV!
|
This PR adds a new command, `cluster`, that can be used to cluster the output from `pairwise` and `multisearch`. `cluster`uses `rustworkx-core` (which internally uses `petgraph`) to build a graph, adding edges between nodes when the similarity exceeds the user-defined threshold. It can work on any of the similarity columns output by `pairwise` or `multisearch`, and will add all nodes to the graph to preserve singleton 'clusters' in the output. `cluster` outputs two files: 1. cluster identities file: `Component_X, name1;name2;name3...` 2. cluster size histogram `cluster_size, count` context for some things I tried: - try using petgraph directly and removing rustworkx dependency > nope,`rustworkx-core` adds `connected_components` that returns the connected components, rather than just the number of connected components. Could reimplement if `rustworkx-core` brings in a lot of deps - try using 'extend_with_edges' instead of add_edge logic. > nope, only in `petgraph` **Punted Issues:** - develop clustering visualizations (ref @mr-eyes kSpider/dbretina work). Optionally output dot file of graph? (#248) - enable updating clusters, rather than always regenerating from scratch (#249) - benchmark `cluster` (#247) > `pairwise` files can be millions of lines long. Would it be faster to parallel read them, store them in an `edges` vector, and then add nodes/edges sequentially? Note that we would probably need to either 1. store all edges, including those that do not pass threshold) or 2. After building the graph from edges, add nodes from `names_to_node` that are not already in the graph to preserve singletons. Related issues: * #219 * sourmash-bio/sourmash#2271 * sourmash-bio/sourmash#700 * sourmash-bio/sourmash#225 * sourmash-bio/sourmash#274 --------- Co-authored-by: C. Titus Brown <[email protected]>
I have run sourmash compare on 2683 signature files each corresponding to a single bin from a large metagenomic dataset. When I then try to plot the output using sourmash plot --labels cmp, I get the error below. Any suggestions on fixing this?
Traceback (most recent call last):
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/bin/sourmash", line 11, in
load_entry_point('sourmash==2.0.0a1', 'console_scripts', 'sourmash')()
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/sourmash_lib/main.py", line 60, in main
cmd(sys.argv[2:])
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/sourmash_lib/commands.py", line 395, in plot
Z1 = sch.dendrogram(Y, orientation='right', labels=labeltext)
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2365, in dendrogram
above_threshold_color=above_threshold_color)
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2651, in _dendrogram_calculate_info
above_threshold_color=above_threshold_color)
<Line 2651 error repeated many times>
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2618, in _dendrogram_calculate_info
above_threshold_color=above_threshold_color)
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2530, in _dendrogram_calculate_info
leaf_label_func, i, labels)
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2387, in _append_singleton_leaf_node
lvs.append(int(i))
RecursionError: maximum recursion depth exceeded while calling a Python object
The text was updated successfully, but these errors were encountered: