use alternative clustering package in sourmash plot, to support larger data sets #274

Quicken-up · 2017-06-05T16:15:45Z

I have run sourmash compare on 2683 signature files each corresponding to a single bin from a large metagenomic dataset. When I then try to plot the output using sourmash plot --labels cmp, I get the error below. Any suggestions on fixing this?

Traceback (most recent call last):
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/bin/sourmash", line 11, in
load_entry_point('sourmash==2.0.0a1', 'console_scripts', 'sourmash')()
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/sourmash_lib/main.py", line 60, in main
cmd(sys.argv[2:])
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/sourmash_lib/commands.py", line 395, in plot
Z1 = sch.dendrogram(Y, orientation='right', labels=labeltext)
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2365, in dendrogram
above_threshold_color=above_threshold_color)
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2651, in _dendrogram_calculate_info
above_threshold_color=above_threshold_color)

File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2618, in _dendrogram_calculate_info
above_threshold_color=above_threshold_color)
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2530, in _dendrogram_calculate_info
leaf_label_func, i, labels)
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2387, in _append_singleton_leaf_node
lvs.append(int(i))
RecursionError: maximum recursion depth exceeded while calling a Python object

ctb · 2017-06-05T16:51:29Z

I believe @taylorreiter has run into this with large data sets. There's no good simple solution AFAIK; the scip cluster hierarchy code just doesn't like so many samples! A solution might be to output the comparison matrix with the --csv output for 'compare' in the latest sourmash master and import it into a program that handles large clusters better. (This problem is serious and is part of the motivation behind issues: #256 #225 but we don't have a solution yet!)

taylorreiter · 2017-06-05T16:59:21Z

from @luizirber when I ran in to this problem:
you can change the recursion depth limit with sys.setrecursionlimit: https://docs.python.org/3/library/sys.html#sys.setrecursionlimit

I did what @ctb suggested and output the matrix as a csv and used R to make a dendrogram without the heatmap.

Quick and dirty R code:

install.packages("fastcluster")
library(fastcluster)
compk4<-read.csv("Oe6_scaffolds_k4.comp.csv")
rownames(compk4)<-colnames(compk4)
cluster_compk4<-hclust(dist(compk4), "cen")
compk4_clusters<-hclust(dist(compk4))
dend <- as.dendrogram(compk4_clusters)
plot(dend)

luizirber · 2017-06-05T17:41:06Z

Maybe it's time to add a dependency on [fastcluster][0] and change the plot code to use it? (It's the same library @taylorreiter suggested in her R solution) [0]: https://pypi.python.org/pypi/fastcluster

ctb · 2017-06-05T17:43:48Z

sounds like something worth exploring for sure! concerned about adding more dependencies tho.

Quicken-up · 2017-06-06T08:59:33Z

Thanks for the suggestions. I will try the csv option, and and also try running compare on a reduced subset of good bins. Should it run OK on 200 bins? 500? Is there a good reason to use recursion in the code rather than iteration? That seems to be the root of the problem.

ctb · 2017-06-06T13:13:21Z

We're using the scipy.cluster.hierarchy package as a black box, so you'd have to ask them why recursion :). I've personally plotted 300x300 on my laptop without any trouble. The question of what approach to include in the sourmash package itself is an interesting one - so far we've chosen something that is straightforward to install and community supported, but we haven't put a lot of thought into it (or at least I haven't). Your experience is valuable in suggesting that we choose something else more scalable as a default. But hey, at least we finally let you export CSV!

@mr-eyes

This PR adds a new command, `cluster`, that can be used to cluster the output from `pairwise` and `multisearch`. `cluster`uses `rustworkx-core` (which internally uses `petgraph`) to build a graph, adding edges between nodes when the similarity exceeds the user-defined threshold. It can work on any of the similarity columns output by `pairwise` or `multisearch`, and will add all nodes to the graph to preserve singleton 'clusters' in the output. `cluster` outputs two files: 1. cluster identities file: `Component_X, name1;name2;name3...` 2. cluster size histogram `cluster_size, count` context for some things I tried: - try using petgraph directly and removing rustworkx dependency > nope,`rustworkx-core` adds `connected_components` that returns the connected components, rather than just the number of connected components. Could reimplement if `rustworkx-core` brings in a lot of deps - try using 'extend_with_edges' instead of add_edge logic. > nope, only in `petgraph` **Punted Issues:** - develop clustering visualizations (ref @mr-eyes kSpider/dbretina work). Optionally output dot file of graph? (#248) - enable updating clusters, rather than always regenerating from scratch (#249) - benchmark `cluster` (#247) > `pairwise` files can be millions of lines long. Would it be faster to parallel read them, store them in an `edges` vector, and then add nodes/edges sequentially? Note that we would probably need to either 1. store all edges, including those that do not pass threshold) or 2. After building the graph from edges, add nodes from `names_to_node` that are not already in the graph to preserve singletons. Related issues: * #219 * sourmash-bio/sourmash#2271 * sourmash-bio/sourmash#700 * sourmash-bio/sourmash#225 * sourmash-bio/sourmash#274 --------- Co-authored-by: C. Titus Brown <[email protected]>

ctb changed the title ~~sourmash plot RecursionError~~ use alternative clustering package in sourmash plot, to support larger data sets Jul 3, 2020

ctb mentioned this issue Sep 23, 2023

visualization guidelines for compare &c. #256

Open

ctb mentioned this issue Feb 26, 2024

MRG: Add graph-based clustering sourmash-bio/sourmash_plugin_branchwater#234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use alternative clustering package in sourmash plot, to support larger data sets #274

use alternative clustering package in sourmash plot, to support larger data sets #274

Quicken-up commented Jun 5, 2017

ctb commented Jun 5, 2017 via email

taylorreiter commented Jun 5, 2017

luizirber commented Jun 5, 2017 via email •

edited

Loading

ctb commented Jun 5, 2017 via email

Quicken-up commented Jun 6, 2017

ctb commented Jun 6, 2017 via email

use alternative clustering package in sourmash plot, to support larger data sets #274

use alternative clustering package in sourmash plot, to support larger data sets #274

Comments

Quicken-up commented Jun 5, 2017

ctb commented Jun 5, 2017 via email

taylorreiter commented Jun 5, 2017

luizirber commented Jun 5, 2017 via email • edited Loading

ctb commented Jun 5, 2017 via email

Quicken-up commented Jun 6, 2017

ctb commented Jun 6, 2017 via email

luizirber commented Jun 5, 2017 via email •

edited

Loading