Skip to content

Finding similar projects in and across orgs by analyzing their descriptions

Notifications You must be signed in to change notification settings

nelabdiel/ghost_clusters

Folders and files

NameName
Last commit message
Last commit date

Latest commit

47b1fbb · Nov 9, 2024

History

4 Commits
Nov 9, 2024
Nov 9, 2024
Nov 9, 2024

Repository files navigation

Project: Ghost Clusters (Work In Progress)

This project is designed to cluster AI use cases based on their titles and descriptions and perform visual analysis of the resulting clusters. The clustering is performed using SBERT for embedding generation and DBSCAN for clustering. We also provide a Jupyter Notebook for further exploration and visualization of the clustered data.

Overview

The project includes:

  • bclustering.py: A Python script that reads a CSV file of project data, generates embeddings using SBERT, clusters the projects using DBSCAN, and saves the clustered data to a new CSV file.
  • analysis.ipynb: A Jupyter Notebook that provides visualizations and analyses of the clustered data, including heatmaps, network graphs, and other insights.

Requirements

Ensure the following Python libraries are installed:

  • pandas
  • sentence-transformers
  • scikit-learn
  • matplotlib
  • seaborn
  • networkx
  • nltk

To install these libraries, run:

pip install pandas sentence-transformers scikit-learn matplotlib seaborn networkx nltk

Usage

Running bclustering.py

bclustering.py is a Python script that reads a CSV file containing project data, generates embeddings using SBERT, and clusters the data using DBSCAN. The script takes the path to the CSV file and the names of the columns containing the project title and description as arguments.

Command-Line Usage

python bclustering.py <path_to_csv> <title_column_name> <description_column_name>

Script Features

  • Embedding Generation: Uses Sentence-BERT (all-MiniLM-L12-v2) to create embeddings from the combined title and description.
  • Clustering: Applies DBSCAN with cosine similarity to cluster the embeddings.
  • Output: Saves the clustered data to a new CSV file with an additional column indicating the cluster label.

Example

python bclustering.py projects.csv title description

Visual Analysis

The analysis.ipynb notebook provides a detailed analysis of the clustered data, including:

  • Cluster Distribution: Visualize the number of projects per cluster using bar plots.
  • Heatmap: Generate a heatmap that shows the number of projects per department and cluster.
  • Network Graphs: Visualize department interactions based on shared clusters with node size proportional to the number of projects and edge thickness representing the number of shared clusters.

Notes

  • The analysis.ipynb notebook assumes the output CSV file from bclustering.py is available.
  • Custom stopwords, including <title> and <description> tags, are added to improve cluster insights.

Future Enhancements

  • Add more clustering algorithms for comparison.
  • Implement interactive visualizations using plotly or bokeh for a more dynamic analysis experience.

About

Finding similar projects in and across orgs by analyzing their descriptions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published