In this article, we’ll demonstrate how to integrate custom clustering models into the Elastic Stack by leveraging a sample text dataset, streamlining the workflow within Elastic’s ecosystem. You can follow along to create a simple clustering pipeline with this Jupyter notebook.
Prologue
The Machine Learning App in Kibana provides a comprehensive suite of advanced capabilities, including anomaly and outlier detection, as well as classification and regression models. It supports the integration of custom models from the scikit-learn library via the eland Python client. While Kibana offers robust machine learning capabilities, it currently does not support clustering analysis in both prebuilt and custom models. Clustering algorithms are crucial for enhancing search relevance by grouping similar queries and for security, where they help identify patterns in data to detect potential threats and anomalies.
Elastic provides the flexibility to leverage custom scikit-learn models, such as k-means, for tasks like clustering—for example, grouping news articles by similarity. While these algorithms aren’t officially supported, you can use the model’s cluster centers as input for the ingest pipeline to integrate these capabilities seamlessly into your Elastic workflow. In the following sections, we’ll guide you through implementing this approach.
Dataset Overview
For this proof of concept, we utilized the 20 Newsgroups dataset, a popular benchmark for text classification and clustering tasks. This dataset consists of newsgroup posts organized into 20 distinct categories, covering topics such as sports, technology, religion, and science. It is widely available through the `scikit-learn` library.
In our experiments, we focused on a subset of 5 categories:
- rec.sport.baseball
- rec.sport.hockey
- comp.sys.ibm.pc.hardware
- talk.religion.misc
- sci.med
These categories were chosen to ensure a mix of technical, casual, and diverse topics for effective clustering analysis.
Feature Extraction and Generating Text Embeddings:
The text documents were cleaned by removing stop words, punctuation, and irrelevant tokens using scikit-learn’s `feature_extraction` utility, ensuring that the text vectors captured meaningful patterns. These features were then used to generate text embeddings using OpenAI’s language model “text-embedding-ada-002”.
The model, text-embedding-ada-002 stands among the most advanced models for generating dense vector representations of text, capturing the nuanced semantic meaning inherent in textual data. We utilized the Azure OpenAI endpoint to generate the embeddings for our analysis. Instructions to use this endpoint with Elasticsearch can be found at Elasticsearch open inference API adds Azure AI Studio support.
The embeddings were normalized before training the k-means clustering model to standardize vector magnitudes. Normalization is a critical preprocessing step for k-means since it calculates clusters based on Euclidean distances. Standardized embeddings eliminate magnitude discrepancies, ensuring that clustering decisions rely purely on semantic proximity, thereby enhancing the accuracy of the clustering results.
We trained the k-means model using k=5 to match the dataset’s categories and extracted the cluster centers. These centers served as inputs for Kibana’s ingest pipeline, facilitating real-time clustering of incoming documents. We’ll discuss this further in the next section.
Dynamic clustering with Ingest pipeline’s script processor
After the model is trained in Scikit-learn, an ingest pipeline is used to assign cluster numbers to each record. This ingest pipeline takes three configurable parameters:
- clusterCenters – a nested list with one list for each cluster center vector. For this blog, they were generated with Scikit-learn.
- analysisField – the field which contains dense vectorized data.
- normalize – normalizes the analysisField vectors.
Once the ingest pipeline is added to an index or datastream, all new ingested data will be assigned a closest cluster number.
The image below illustrates the end-to-end workflow of importing clustering in Kibana.
The full ingest pipeline script can be generated using Python, an example is in the “Add clustering ingest pipeline” section of the notebook. We’ll dive into the specifics of the ingest pipeline below.
The cluster_centers are then loaded as a nested list of floats, with one list for each cluster center.
In the first part of the Painless script, two functions are defined. The first is euclideanDistance
, which returns a distance between two arrayLists as a float. The second, l2NormalizeArray
, scales an arrayLists so that the sum of its squared elements is equal to one.
Then the inference step of k-Means is performed. For every cluster center, the distance is taken between a new incoming document vector using the ingest pipeline context (ctx) and the analysisField
parameter, which selects the field containing the OpenAI text-ada-002 vector. The closestCluster
number is then assigned to the document based on the closest cluster center, that is, the document which has the shortest distance. Additionally, if the normalize parameter is set to true, the L2 norm of the incoming document vector is taken before doing the distance calculation.
Then the closestCluster
and minDistance
value to that cluster are passed back to the document through the ingest pipeline context.
There are a few configurable parameters, which are described above but included here for reference. The first is the clusterCenters
, a nested array of floats, with one array for each cluster center. The second is the analysisField
, the field which contains the text-ada-002 vectors. Lastly, normalize
which will L2 normalize
the document vector. Note that the normalize parameter should only be set to True if the vectors are also normalized before training the k-Means model.
Finally, once the pipeline is configured, assign an ID and put it on the cluster.
Clustering Results
We expect the clustering results to show each category forming a distinct cluster. While baseball and hockey might overlap due to their shared sports context; technical, religious, and medical categories should form separate and clearly defined clusters. When OpenAI text-ada-002 vectors are viewed with the t-SNE dimensionality reduction algorithm, they show that there is clear separation between these clusters, and that the sports topics are close together:
Actual newsgroup labels; 2D t-SNE trained on OpenAI text-ada-002 vectors
The location of the points indicates clear separation between the groupings, which indicates that the vectorization is capturing the semantic meaning of each article.
As a result, the zero-shot classification results are excellent. Even though no labels were provided in training data to the model, with only the number of clusters provided, on in-sample data a k-means model provides greater than 94% accuracy when assigning cluster numbers:
Predicted cluster labels; 2D t-SNE trained on OpenAI text-ada-002 vectors
Comparing the actual newsgroup labels to the in-sample predicted labels, there is very little difference between the actual newsgroup labels and those predicted by the clustering model. This is represented by the confusion matrix:
Zero-shot Classification Confusion Matrix on OpenAI text-ada-002 vectors
The diagonal on the confusion matrix represents the in-sample accuracy of each category, the model is predicting the correct label more than 94% of the time for each category.
Detecting outliers in the Clusters
The k-means model can be viewed as an approximation of a Gaussian Mixture Model (GMM) without capturing covariance, where the quantiles of distances from the nearest cluster being an approximation of the distribution quantile. This means that a k-mean model can capture an approximation of the data distribution. With this approach, a large number of clusters can be chosen, in this case 100, and a new model trained. The higher number of clusters, the more flexible the fit of the distribution. So in this case, the goal is not to learn the internal groupings of the data, but rather capture the distribution of the data overall.
The distance quantiles can be computed with a query. In this case, a model was trained with 100 clusters and the 75th percentile distances were chosen as the cutoff for outliers. Starting with the same graph above showing the t-SNE representation of the actual newsgroup labels:
Actual newsgroup labels; 2D t-SNE trained on OpenAI text-ada-002 vectors
When adding data in from newsgroups which were not in the training set, the 2D t-SNE representation shows a good fit for the data. Here, orange datapoints are not considered outliers, while those which are dark grey are labelled as outliers:
Outlier results for k=100; 2D t-SNE trained on OpenAI text-ada-002 vectors
Bringing it all together
In this blog, we demonstrated how to integrate custom clustering models into the Elastic Stack. We developed a workflow that imports scikit-learn clustering models, such as k-means, into the Elastic Stack, enabling clustering analysis directly within Kibana. By using the 20 Newsgroups dataset, we demonstrated how to apply this workflow to group similar documents, while also discussing the use of advanced text embedding models such as OpenAI's “text-embedding-ada-002” to create semantic representations essential for efficient clustering.
The results section showcased clear cluster separation, indicating that the “text-embedding-ada-002” model captures semantic meaning effectively. The k-means model achieved over 94% accuracy in zero-shot classification, with the confusion matrix showing minimal discrepancies between predicted and actual labels, confirming its strong performance.
With this workflow, Elastic users can apply clustering techniques to their own datasets, whether for grouping similar queries in search or detecting unusual patterns for security applications. The solution presented here provides an easy way to integrate advanced clustering functionality into Elastic. We hope this inspires you to explore these capabilities and apply them to your own use cases.
What’s next?
The clustering results above show that the Painless implementation accurately clusters similar topics, achieving 94% accuracy in performance. Moving forward, our goal is to test the pipeline on a less structured dataset with significantly more noise and a larger number of clusters. This will help evaluate its performance in more challenging scenarios. While k-means has shown decent clustering results, exploring alternatives like Gaussian Mixture Models or Mean Shift for outlier detection might yield better outcomes. These methods could also be implemented using a Painless script or an ingest pipeline.
In the future, we think this workflow can be enhanced with ELSER, as we could use ELSER to first retrieve relevant features from the dataset, which would then be used for clustering, further improving the model’s performance and relevance in the analysis. Additionally, we would like to address how to properly set the correct number of clusters, and how to effectively deal with model drift.
In the meantime, if you have similar experiments or use cases to share, we’d love to hear about them! Feel free to provide feedback or connect with us through our community Slack channel or discussion forums.
Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.