Module cluster
implemented for clustering text document into similar clusters. This is a example program to cluster documents with jaseci cluster
module. We will use input as list of raw text documents and will produce cluster labels for each text documents.
- For executing jaseci open terminal and run following command.
jsctl -m
- Load
cluster
module in jac shell sessionactions load module jac_misc.cluster
- Load
use_enc
module in jac shell sessionactions load module jac_nlp.use_enc
In this section, we'll take raw text as input, encode it, and then output a list of features with decreased dimensions. This can be utilized for further clustering in next section.
Save the text data in json
format.
walker features{
can file.load_json;
has text = file.load_json("text_data.json");
}
In this section we are using use.encode jaseci module to encode raw text. The use.encode
will return size of 512 vectors for each text document. We are reducing the dimention of vectors using cluster.get_umap
action.
UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction algorithm used for visualizing and exploring high-dimensional data. It aims to preserve both global and local structure of the data by representing it as a low-dimensional embedding while minimizing distortion. It has been shown to be highly effective in preserving non-linear structure and identifying clusters in high-dimensional datasets. UMAP has applications in various fields including machine learning, image processing, and bioinformatics.
** Parameters of cluster.get_umap
**
-
text_embeddings
: list - This is a mandotory field. list of text embeddings should pass here. -
n_neighbors
: int - By defauld this value is15
. This is not a manodoty field, but if you want to get better out of this you have to set a value for this based on your input data. This parameter balances local versus global structure in the data. Low values will focus on local data points (will make an impact on the big picture), higher values will focus on the global data points (overall structure of the data) (will lose fine details in the structure). -
min_dist
: float - By default this value is 0.1. This is also not a mandotory field. This parameter controls how tightlycluster.get_umap
is allowed to pack points together. Set this to low value when trying for clustering. -
n_components
: int - The default value for this is 2, however it is not mandtory field. This represents the dimensionality of the reduced data. This is not limited 2 or 3 can try further like pca. -
random_state
: int - By default this is 42. This represent the preproducability of the algorithm.
node feature_embedd{
can use.encode;
can cluster.get_umap;
has final_features;
can set_features with features entry{
encode = use.encode(visitor.text);
final_features = cluster.get_umap(encode,2);
}
}
We will obtain cluster labels for each text document in this section. The output from the previous section is the input here. To get cluster lables we are using cluster.get_cluster_labels
action.
For clustering with Jaseci there are two algorithms are available. HBDSCAN algorithm and the Kmeans Algorithm.
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that seeks to identify clusters of varying densities in a dataset. It constructs a hierarchy of clusters by recursively partitioning data points based on their local density and connectivity. The algorithm automatically determines the number of clusters and identifies noise points as well. HDBSCAN has been shown to be effective in identifying clusters of varying shapes and sizes in high-dimensional datasets. It has applications in various fields including image processing, social network analysis, and bioinformatics.
K-means is a popular clustering algorithm used for partitioning a dataset into k clusters, where k is a pre-defined number. It works by iteratively assigning each data point to the nearest centroid (mean) and then re-calculating the centroids based on the new cluster assignments. The algorithm stops when the cluster assignments no longer change significantly or after a maximum number of iterations. K-means is widely used due to its simplicity, scalability, and efficiency in handling large datasets. It has applications in various fields including customer segmentation, image processing, and bioinformatics. However, it assumes that the clusters are spherical and have equal variance, which may not always be the case in real-world scenarios.
Parameters of cluster.get_cluster_labels
-
embeddings: list - This accept list of embedded text features, this is a mandotory field.
-
algorithm: str - By default the value of this is "hbdscan". So far jaseci only support
hbdscan
andkmeans
algorithms for clutering. -
min_samples: int - This is a mandotory field if only you are using
hbdscan
algorithm. The minimum number of data points in a cluster is represented here. Increasing this will reduces number of clusters. -
min_cluster_size: int - This is a mandotory field if only you are using
hbdscan
algorithm. This represents how conservative you want your clustering should be. Larger values more data points will be considered as noise -
n_clusters: int - This is also a mandotory field if only you are using
kmeans
algorithm. This defines how many number of clusters you need.
can cluster.get_cluster_labels;
has labels;
has final_features;
can set_lables{
labels = cluster.get_cluster_labels(embeddings=final_features,algorithm="hbdscan",min_samples=2,min_cluster_size=2);
report labels;
}
If you are going to use kmeans
algorithm, the set_lables
ability should be as follows;
can set_lables{
labels = cluster.get_cluster_labels(embeddings=final_features,algorithm="kmeans",min_samples=0,min_cluster_size=0,n_clusters=2);
report labels;
}
The complete code with the graph structure.
graph text_cluster_graph {
has anchor text_feature;
spawn {
text_feature = spawn node::feature_embedd;
text_cluster = spawn node::cluster_labels;
text_feature -[cluster_model(model_type="hbdscan")]-> text_cluster;
}
}
node feature_embedd{
can use.encode;
can cluster.get_umap;
has final_features;
can set_features with features entry{
encode = use.encode(visitor.text);
final_features = cluster.get_umap(encode,2);
}
}
node cluster_labels{
can cluster.get_cluster_labels;
has labels;
}
edge cluster_model{
has model_type;
}
walker features{
can file.load_json;
has text = file.load_json("text_data.json");
}
walker init{
has final_features;
can set_lables{
labels = cluster.get_cluster_labels(embeddings=final_features,algorithm="hbdscan",min_samples=2,min_cluster_size=2);
report labels;
}
root {
spawn here --> graph::text_cluster_graph;
take-->;
}
feature_embedd{
spawn here walker::features;
final_features = here.final_features;
take-->;
}
cluster_labels{
::set_lables;
}
}
Save the above code in a file with name cluster.jac
and save the following text data inside the same directory.
[
"still waiting card",
"countries supporting",
"card still arrived weeks",
"countries accounts suppor",
"provide support countries",
"waiting week card still coming",
"track card process delivery",
"countries getting support",
"know get card lost",
"send new card",
"still received new card",
"info card delivery",
"new card still come",
"way track delivery card",
"countries currently support"
]
Run the jac code in the terminal with jac run cluster.jac
command. You will see the output as follows;
{
"success": true,
"report": [
[
0,
2,
0,
2,
2,
0,
3,
2,
0,
1,
1,
3,
1,
3,
2
]
],
"final_node": "urn:uuid:8828d927-044d-4dec-85b4-65ba34e4a93c",
"yielded": false
}