Is an an image-embedding model developed by Cohere. It focuses on the representation of enterprise data like charts, product images, and design files.

Exploring CLIP alternatives - Elasticsearch Labs

Q: Jina AI is a CLIP variant developed by Jina AI, designed specifically to improve the search of images and text in multimodal applications.

Jina AI is a CLIP variant developed by Jina AI, designed specifically to improve the search of images and text in multimodal applications.

In this article, we'll cover the CLIP multimodal model, explore alternatives, and analyze their pros and cons through a practical example of a mock real estate website that allows users to search for properties using pictures as references.

What is CLIP?

CLIP (Contrastive Language–Image Pre-training) is a neural network created by OpenAI, trained with pairs of images and texts to solve tasks of finding similarities between text and images and categorize "zero-shot" images so the model was not trained with fixed tags but instead, we provide unknown classes for the model so it can classify the image we provide to it.

CLIP has been the state-of-the-art model for a while and you can read more articles about it here:

However, over time more alternatives have come up.

In this article, we'll go through the pros and cons of two alternatives to CLIP using a real estate example. Here’s a summary of the steps we’ll follow in this article:

Basic configuration: CLIP and Elasticsearch

For our example, we will create a small project with an interactive UI using Python. We will install some dependencies, like the Python transformers, which will grant us access to some of the models we'll use.

Create a folder /clip_comparison and follow the installation instructions located here. Once you're done, install the Elasticsearch's Python client, the Cohere SDK and Streamlit:

NOTE: As an option, I recommend using a Python virtual environment (venv). This is useful if you don't want to install all of the dependencies on your machine.

Streamlit is an open-source Python framework that allows you to easily get a UI with little code.

We'll also create some files to save the instructions we'll use later:

app.py: UI logic.
/services/elasticsearch.py: Elasticsearch client initialization, queries, and bulk API call to index documents.
/services/models.py: Model instances and methods to generate embeddings.
index_data.py: Script to index images from a local source.
/data: our dataset directory.

Our App structure should look something like this:

Configuring Elasticsearch

Follow the next steps to store the images for the example. We'll then search for them using knn vector queries.

Note: We could also store text documents but for this example, we will only search in the images.

Index Mappings

Access Kibana Dev Tools (from Kibana: Management > Dev Tools) to build the data structure using these mappings:

The field type dense_vector will store the embeddings generated by the models. The field binary will store the images as base64.

Note: It's not a good practice to store images in Elasticsearch as binary. We're only doing it for the practical purpose of this example. The recommendation is to use a static files repository.

Now to the code. The first thing we need to do is initialize the Elasticsearch client using the cloud id and api-key. Write the following code at the beginning of the file /services/elasticsearch.py:

Configuring models

To configure the models, put the model instances and their methods in this file: /services/models.py.

The Cohere Embed-3 model works as a web service so we need an API key to use it. You can get a free one here. The trial is limited to 5 calls per minute, and 1,000 calls per month.

To configure the model and make the images searchable in Elasticsearch, follow these steps:

Convert images to vectors using CLIP
Store the image vectors in Elasticsearch
Vectorize the image or text we want to compare to the stored images.
Run a query to compare the entry of the previous step to the stored images and get the most similar ones.

Configuring CLIP

To configure CLIP, we need to add to the file models.py, the methods to generate the image and text embeddings.

For all the models, you need to declare similar methods: one to generate embeddings from an image (clip_image_embeddings) and another one to generate embeddings from text (clip_text_embeddings).

The outputs.detach().cpu().numpy().flatten().tolist() chain is a common operation to convert pytorch tensors into a more usable format:

.detach(): Removes the tensor from the computation graph as we no longer need to compute gradients.

.cpu(): Moves tensors from GPU to CPU as numpy only supports CPU.

.numpy(): Converts tensor to numPy array.

.flatten(): Converts into a 1D array.

.toList(): Converts into Python List.

This operation will convert a multidimensional tensor into a plain list of numbers that can be used for embeddings operations.

Let's now take a look at some CLIP alternatives.

Competitor 1: JinaCLIP

JinaCLIP is a CLIP variant developed by Jina AI, designed specifically to improve the search of images and text in multimodal applications. It optimizes CLIP performance by adding more flexibility in the representation of images and text.

Compared to the original OpenAI CLIP model, JinaCLIP performs better in text-to-text, text-to-image, image-to-text, and image-to-image tasks as we can see in the chart below:

Model	Text-Text	Text-to-Image	Image-to-Text	Image-Image
jina-clip-v1	0.429	0.899	0.803	0.916
openai-clip-vit-b16	0.162	0.881	0.756	0.816
%increase vs OpenAI CLIP	165%	2%	6%	12%

The capacity to improve precision in different types of queries makes it a great tool for tasks that require a more precise and detailed analysis.

You can read more about JinaCLIP here.

To use JinaCLIP in our app and generate embeddings, we need to declare these methods:

Competitor 2: Cohere Image Embeddings V3

Cohere has developed an image embedding model called Embed-3, which is a direct CLIP competitor. The main difference is that Cohere focused on the representation of enterprise data like charts, product images, and design files. Embed-3 uses an advanced architecture that reduces the bias risk towards text data, which is currently a disadvantage in other multimodal models like CLIP, so it can provide more precise results between text and image.

You can see below a chart by Cohere showing the improved results using Embed 3 versus CLIP in this kind of data:

Cohere Image Embeddings V3 performance chart

For more info, go to Embed3.

Just like we did with the previous models, let's declare the methods to use Embed 3:

With the functions ready, let's index the dataset in Elasticsearch by adding the following code to the file index_data.py:

Index the documents using the command:

The response will show us the amount of elements indexed by index:

Once the dataset has been indexed, we can create the UI.

Test UI

Creating the UI

We are going to use Streamlit to build a UI and compare the three alternatives side-by-side.

To build the UI, we'll start by adding the imports and dependencies to the file app.py:

For this example, we'll use two views; one for the image search and another one to see the image dataset:

Let's add the view code for Search Image:

And now, the code for the Images view:

We'll run the app with the command:

Thanks to multimodality, we can run searches in our image database based on text (text-to-image similarity) or image (image-to-image similarity).

Searching with the UI

To compare the three models, we'll use a scenario in which a real estate webpage wants to improve its search experience by allowing users to search using image or text. We'll discuss the results provided by each model.

We'll upload the image of a "rustic home":

Here we have the search results. As you can see, based on the image we uploaded, each model generated different results:

OpenAI CLIP, JinaCLIP and Cohere image search results comparison

In addition, you can see results based on the text to find the house features:

OpenAI CLIP, JinaCLIP and Cohere results based on the text

OpenAI CLIP, JinaCLIP and Cohere results based on the text 2

If searching for “modern”, the three models will show good results. But, JinaCLIP and Cohere will be showing the same houses in the first positions.

Features Comparison

Below you have a summary of the main features and prices of the three options we covered in this article:

Model	Created by	Estimated Price	Features
CLIP	OpenAI	$0.00058 per run in Replicate (https://replicate.com/krthr/clip-embeddings)	General multimodal model for text and image; suitable for a variety of applications with no specific training.
JinaCLIP	Jina AI	$0.018 per 1M tokens in Jina (https://jina.ai/embeddings/)	CLIP variant optimized for multimodal applications. Improved precision retrieving texts and images.
Embed-3	Cohere	$0.10 per 1M tokens, $0.0001 per data and images at Cohere (https://cohere.com/pricing)	Focuses on enterprise data. Improved retrieval of complex visual data like graphs and charts.

If you will search on long image descriptions, or want to do text-to-text as well as image-to-text, you should discard CLIP, because both JinaCLIP and Embed-3 are optimized for this use case.

Then, JinaCLIP is a general-use model, while Cohere’s one is more focused on enterprise data like products, or charts.

When testing the models on your data, make sure you cover:

All modalities you are interested in: text-to-image, image-to-text, text-to-text
Long and short image descriptions
Similar concept matches (different images of the same type of object)
Negatives
- Hard negative: Similar to the expected output but still wrong
- Easy negative: Not similar to the expected output and wrong
Challenging scenarios:
- Different angles/perspectives
- Various lighting conditions
- Abstract concepts ("modern", "cozy", "luxurious")
Domain-specific cases:
- Technical diagrams or charts (especially for Embed-3)
- Product variations (color, size, style)

Conclusion

Though CLIP is the preferred model when doing image similarity search, there are both commercial and non-commercial alternatives that can perform better in some scenarios.

JinaCLIP is a robust all-in-one tool that claims to be more precise than CLIP in text-to-text embeddings.

Embed-3 follows Cohere's line of catering to business clients by training models with real data using typical business docs.

In our small experiment, we could see that both JinaClip and Cohere show interesting image-to-image and text-to-image results and perform very similarly to CLIP with these kinds of tasks.

Elasticsearch allows you to search for embeddings, combining vector search with full-text-search, enabling you to search both for images and for the text in them.

Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.

Report an issue