Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use alternative clustering package in sourmash plot, to support larger data sets #274

Open
Quicken-up opened this issue Jun 5, 2017 · 6 comments

Comments

@Quicken-up
Copy link

I have run sourmash compare on 2683 signature files each corresponding to a single bin from a large metagenomic dataset. When I then try to plot the output using sourmash plot --labels cmp, I get the error below. Any suggestions on fixing this?

Traceback (most recent call last):
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/bin/sourmash", line 11, in
load_entry_point('sourmash==2.0.0a1', 'console_scripts', 'sourmash')()
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/sourmash_lib/main.py", line 60, in main
cmd(sys.argv[2:])
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/sourmash_lib/commands.py", line 395, in plot
Z1 = sch.dendrogram(Y, orientation='right', labels=labeltext)
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2365, in dendrogram
above_threshold_color=above_threshold_color)
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2651, in _dendrogram_calculate_info
above_threshold_color=above_threshold_color)

<Line 2651 error repeated many times>

File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2618, in _dendrogram_calculate_info
above_threshold_color=above_threshold_color)
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2530, in _dendrogram_calculate_info
leaf_label_func, i, labels)
File "/exports/cmvm/eddie/eb/groups/watson_grp/software/myanaconda/sourmash_env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py", line 2387, in _append_singleton_leaf_node
lvs.append(int(i))
RecursionError: maximum recursion depth exceeded while calling a Python object

@ctb
Copy link
Contributor

ctb commented Jun 5, 2017 via email

@taylorreiter
Copy link
Contributor

from @luizirber when I ran in to this problem:
you can change the recursion depth limit with sys.setrecursionlimit: https://docs.python.org/3/library/sys.html#sys.setrecursionlimit

I did what @ctb suggested and output the matrix as a csv and used R to make a dendrogram without the heatmap.

Quick and dirty R code:

install.packages("fastcluster")
library(fastcluster)
compk4<-read.csv("Oe6_scaffolds_k4.comp.csv")
rownames(compk4)<-colnames(compk4)
cluster_compk4<-hclust(dist(compk4), "cen")
compk4_clusters<-hclust(dist(compk4))
dend <- as.dendrogram(compk4_clusters)
plot(dend)

@luizirber
Copy link
Member

luizirber commented Jun 5, 2017 via email

@ctb
Copy link
Contributor

ctb commented Jun 5, 2017 via email

@Quicken-up
Copy link
Author

Thanks for the suggestions. I will try the csv option, and and also try running compare on a reduced subset of good bins. Should it run OK on 200 bins? 500? Is there a good reason to use recursion in the code rather than iteration? That seems to be the root of the problem.

@ctb
Copy link
Contributor

ctb commented Jun 6, 2017 via email

@ctb ctb changed the title sourmash plot RecursionError use alternative clustering package in sourmash plot, to support larger data sets Jul 3, 2020
bluegenes added a commit to sourmash-bio/sourmash_plugin_branchwater that referenced this issue Feb 27, 2024
This PR adds a new command, `cluster`, that can be used to cluster the output from `pairwise` and `multisearch`.

`cluster`uses `rustworkx-core` (which internally uses `petgraph`) to build a graph, adding edges between nodes when the similarity exceeds the user-defined threshold. It can work on any of the similarity columns output by `pairwise` or `multisearch`, and will add all nodes to the graph to preserve singleton 'clusters' in the output.

`cluster` outputs two files: 
1. cluster identities file: `Component_X, name1;name2;name3...`
2. cluster size histogram `cluster_size, count`

context for some things I tried:
- try using petgraph directly and removing rustworkx dependency
> nope,`rustworkx-core` adds `connected_components` that returns the connected components, rather than just the number of connected components. Could reimplement if `rustworkx-core` brings in a lot of deps
- try using 'extend_with_edges' instead of add_edge logic.
> nope, only in `petgraph`

**Punted Issues:**
- develop clustering visualizations (ref @mr-eyes kSpider/dbretina work). Optionally output dot file of graph? (#248)
- enable updating clusters, rather than always regenerating from scratch (#249)
- benchmark `cluster` (#247)
>  `pairwise` files can be millions of lines long. Would it be faster to parallel read them, store them in an `edges` vector, and then add nodes/edges sequentially? Note that we would probably need to either 1. store all edges, including those that do not pass threshold) or 2. After building the graph from edges, add nodes from `names_to_node` that are not already in the graph to preserve singletons.


Related issues:

* #219
* sourmash-bio/sourmash#2271
* sourmash-bio/sourmash#700
* sourmash-bio/sourmash#225
* sourmash-bio/sourmash#274


---------

Co-authored-by: C. Titus Brown <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants