Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community_detection for millions sentences really slow #1654

Closed
sirCamp opened this issue Jul 28, 2022 · 2 comments · Fixed by #2381
Closed

community_detection for millions sentences really slow #1654

sirCamp opened this issue Jul 28, 2022 · 2 comments · Fixed by #2381

Comments

@sirCamp
Copy link

sirCamp commented Jul 28, 2022

Hi everyone!

I'm trying to create cluster via community_detection, total I've more than 10M sentences.
The embedding/enconding generation is pretty fast and it works like a charm.

The problems is the process for the community_detection clustering.
In this case, the process (even with larger batch_size) is really slow.
I'm trying to figure out a way to speed up the process and to make it fit in RAM.

Do you have any suggestions? Guidelines or solutions?

@nreimers
Copy link
Member

Clustering 10M embeddings is hard, as it computes 10M * 10M scores.

I would reduce the vector space, e.g. first run k-means (e.g. with faiss) to break the vector space in 10-100 smaller spaces. Then cluster each space individually.

@tomaarsen
Copy link
Collaborator

Hello!

#2381 should improve the efficiency of calling community_detection on GPU. It will be included in the next release. Hopefully that helps with your issue. Beyond that, Nils' suggestion is good as well, as it would still need to compute 10M * 10M scores, which will just always be somewhat slow.

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants