community_detection for millions sentences really slow #1654

sirCamp · 2022-07-28T14:37:30Z

Hi everyone!

I'm trying to create cluster via community_detection, total I've more than 10M sentences.
The embedding/enconding generation is pretty fast and it works like a charm.

The problems is the process for the community_detection clustering.
In this case, the process (even with larger batch_size) is really slow.
I'm trying to figure out a way to speed up the process and to make it fit in RAM.

Do you have any suggestions? Guidelines or solutions?

The text was updated successfully, but these errors were encountered:

nreimers · 2022-07-29T06:15:39Z

Clustering 10M embeddings is hard, as it computes 10M * 10M scores.

I would reduce the vector space, e.g. first run k-means (e.g. with faiss) to break the vector space in 10-100 smaller spaces. Then cluster each space individually.

tomaarsen · 2023-12-14T15:29:19Z

Hello!

#2381 should improve the efficiency of calling community_detection on GPU. It will be included in the next release. Hopefully that helps with your issue. Beyond that, Nils' suggestion is good as well, as it would still need to compute 10M * 10M scores, which will just always be somewhat slow.

Tom Aarsen

tomaarsen mentioned this issue Dec 14, 2023

[enhancement] Improve efficiency of community detection on GPU #2381

Merged

tomaarsen closed this as completed in #2381 Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community_detection for millions sentences really slow #1654

community_detection for millions sentences really slow #1654

sirCamp commented Jul 28, 2022

nreimers commented Jul 29, 2022

tomaarsen commented Dec 14, 2023

community_detection for millions sentences really slow #1654

community_detection for millions sentences really slow #1654

Comments

sirCamp commented Jul 28, 2022

nreimers commented Jul 29, 2022

tomaarsen commented Dec 14, 2023