Using TwelveLabs’ Marengo video embedding model with Amazon Bedrock and Elasticsearch

From vector search to powerful REST APIs, Elasticsearch offers developers the most extensive search toolkit. Dive into sample notebooks on GitHub to try something new. You can also start your free trial or run Elasticsearch locally today.

This blog post explores TwelveLabs’ new Bedrock integration for its video embedding model, Marengo, and demonstrates how to use the resulting video embeddings with the Elasticsearch vector database. The walkthrough below details how this combination can be used to search through trailers for recent summer blockbuster movies.

A screenshot of a video search result showing two characters from Lilo & Stitch surfing, with a relevance score of 0.68.

Motivation

Real data isn’t just text. In today’s world of TikTok, work video calls, and live-streamed conferences, content is increasingly video-based. This is also true in the enterprise space. Whether for public safety, compliance audits, or customer satisfaction, multi-modal AI has the potential to unlock audio and video for knowledge applications.

However, when I search through mountains of content, I often get frustrated that I can’t find a video unless the words I am searching for were captured in metadata or spoken aloud in the recording. The expectation of “it just works” in the era of mobile apps has shifted to “it just understands my data” in the era of AI. To achieve this, AI needs to access video natively, without first converting it to text.

Terms like spatial reasoning and video understanding have applications in both video and robotics. Adding video understanding to our toolset will be an important step toward building AI systems that can go beyond text.

Video model superpowers

Before working with dedicated video models, my standard approach was to generate audio transcriptions using a model like Whisper, combined with dense vector embeddings from an image model for still frames extracted from the video. This approach works well for some videos, but fails when the subject changes rapidly or when the information concept is really in the motion of filmed subjects.

Simply put, relying solely on an image model leaves out much of the information stored in the video content.

A diagram showing how a video model processes motion compared to an image model. The top part shows a series of still images of a galloping horse, while the bottom compares a single "sub-sampled" still image processed by an "Image Model" with a series of frames processed by a "Video Model," illustrating how a video model captures motion.

I was first introduced to TwelveLabs a few months ago through their SaaS platform, which allows you to upload videos for one-stop-shop asynchronous processing. They have two model families:

Marengo is a multi-modal embedding model that can capture meaning not just from still images but also from moving video clips—similar to how a text embedding model can capture meaning from a whole paragraph chunk and not just single words.

Pegasus is a video understanding model that can be used to generate captions or answer RAG-style questions about clips as context.

While I liked the ease of use and APIs of the SaaS service, uploading data isn’t always permissible. My customers often have terabytes of sensitive data that is not allowed to leave their control. This is where AWS Bedrock comes into play.

TwelveLabs has made their premier models available on the on-demand Bedrock platform, allowing source data to remain in my controlled S3 buckets and only be accessed in a secure-computing pattern, without needing to persist in a 3rd-party system. This is fantastic news because video use cases from enterprise customers often contain trade secrets, records with PII, or other information subject to strict security and privacy regulations.

I suspect Bedrock integration unblocks many use cases.

Let’s search some movie trailers

Note: Full code for Python imports and working with environment variables through a .env file is in the Python notebook version of the code.

Dependencies:

You will need an S3 bucket that can be written to by your AWS ID
You will need the host URL and an API key for your Elasticsearch, either deployed locally or in Elastic Cloud
This code assumes Elasticsearch version 8.17+ or 9.0+

A great data source for quick testing is movie trailers. They have fast edits, are visually stunning, and often contain high-action scenes. Grab your own .mp4 files or use https://github.com/yt-dlp/yt-dlp to access files from YouTube at a small scale.

Once the files are in our local file system, we’ll need to upload them to our S3 bucket:

Now we can create our video embeddings using asynchronous Bedrock calls:

Now that we’ve got the embeddings in our local in-memory video objects, here’s a quick test-print to show what came back:

The output is as follows:

Inserting into Elasticsearch

We’ll upload the objects—in my case, the metadata and embeddings for about 155 video clips—to Elasticsearch. At such a small scale, using a flat float32 index for brute-force nearest neighbor is the most efficient and cost-effective approach. However, the example below demonstrates how to create a different index for each popular quantization level supported by Elasticsearch for large-scale use cases. See this article on Elastic’s Better Binary Quantization (BBQ) feature.

Running a search

The Bedrock implementation for TwelveLabs allows for an async call to generate text-to-vector embeddings to S3. Below, however, we’ll use the lower-latency synchronous invoke_model to get a text embedding directly for our search query. (Text Marengo documentation samples are here.)

The returned JSON is our search result! But to create an easier-to-use user interface for testing, we can use some quick iPython widgets:

Let’s run a search for something visual in our trailers.

A screenshot of a video search interface showing results for "swimming dinosaur," with multiple clips from the "Jurassic World Rebirth" trailer listed alongside their relevance scores and a preview of a dinosaur swimming. Below the results is a text box pointing to the image previews, stating "High action scenes with water + dinosaurs / dragons.

Comparing quantization methods

Newer versions of Elasticsearch default to bbq_hnsw for 1024-dimension dense vectors, which provide the best speed and scalability while preserving accuracy through rescoring on the original float32 within an oversampled candidate window.

For a simple UI to compare the impact of quantization on search results, check out a new project called Relevance Studio.

If we check our index management in Kibana or with a curl to GET /_cat/indices , we’ll see that each option is roughly the same size in storage. At first glance, this can be confusing, but remember that the storage size is about equal because the indices contain the float32 representation of the vector for rescoring. In the bbq_hnsw, only the quantized binary representation of the vector is used in the graph, leading to cost and performance savings for indexing and search.

The Elasticsearch Kibana Index Management interface, showing five indexes with similar names, all containing 155 documents but with slightly different storage sizes, illustrating a comparison of different vector quantization methods.

Last thoughts

These are impressive results for single 1024-dimensional dense vectors, and I’m excited to try combining the power of the Marengo model with hybrid search approaches that include audio transcriptions as well as Elasticsearch’s geospatial filters and RBAC/ABAC access controls. What videos do you wish AI knew everything about?

Report an issue