Exploring Vertex AI with Elasticsearch

Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.

starting with Elasticsearch version 9.1.0, you can integrate Vertex AI models, including Gemini, to use them with Elasticsearch. This latest version adds completion and chat_completion capacities to the existing embedding and reranking functionalities, so you can configure them in the models through an AI connector.

Vertex AI allows you to use models like Gemini 2.5 Pro and Flash, which are useful for reasoning and text flow with RAG. Additionally, with Vertex AI you can deploy your models for more customization and fine-tuning.

We picked gemini-2.5-flash-lite because it has the best balance between price and performance while having high scores in reasoning benchmarks. It is ranked in the top of the fastest and cheapest models, which makes it a good choice to start. If we need more power, then we can switch to gemini-2.5-pro. Gemini 2.5-mini is ideal for low latency, high volume of data processing, for example, RAG applications like the one we are going to create.

*source: https://deepmind.google/models/gemini/flash-lite/*

In this article, you’ll learn how to configure a basic Vertex AI model in Elasticsearch to use it from Kibana’s Playground. We will set up our GCP service Account and configure gemini-2.5-flash-lite to create a RAG application with Playground.

Here’s a diagram of our basic configuration:

setting up AI Vertex Connector

The first step is creating a service account in GCP to utilize the Vertex AI Platform. If you already have one, just skip this step, but make sure you have the authentication JsON file at hand and that the account has the Vertex AI User and service Account Token Creator roles assigned.

Creating a GCP service Account

To create a GCP service account, you must go to this link, choose the project that will have the account, and click on “+ Create service account.”

Choose a name for the service account and click on “Create and continue.” On the next menu, add the permissions for the following two roles:

Vertex AI User.
service Account Token Creator: This role allows the account to generate the necessary access tokens.

Click on “Done.”

Once the service account is created, you must download the JsON access key. On the next link, select the account you’ve just created. Go to “Keys,” then click on “Add key,” and then “Create new key.”

In the pop-up window, make sure that JsON is marked as the key type and then click on “Create.”

This will download a JsON key that you’ll need for the next steps.

Creating an Elasticsearch cluster

To consume the Vertex model, we’ll create an Elastic Cloud serverless cluster by registering here, but you can choose the deployment type that fits your needs. For this tutorial, we’ll select the search use case.

Then, the form will ask you to choose a cloud provider and region. Then, you’ll need to select an “optimized for vectors” project. This step is only required on serverless deployments.

Once the cluster is implemented, go to Kibana for the next steps.

Creating an AI Connector

Now that your cluster is ready and you have access to Vertex AI, you’re ready to create the connector. In Kibana, go to the Connectors menu (Management > stack Management > Alerts and Insights > Connectors). Then, create a connector and select AI Connector.

Configure the connector with these parameters:

Connector name: Vertex AI.
service: Google Vertex AI.
JsON Credentials: Here, you need to copy/paste the full content of the access key you created in the previous steps.
GCP Project: ID of the project where the service account and the Vertex AI models are.
GCP Region: Region where the models are (us-central1 has access to most Gemini models).
Model ID: gemini-2.5-flash-lite
Task Type: chat_completion.

Your connector should look like this:

Besides this configuration, you have “additional options” that allow you to define key properties both for the model and the inference endpoint that will be available through the connector.

Rate limit: Optionally define the maximum number of requests per minute to send.
Task type: Task to carry out with the model. This new version adds completion and chat_completion:
- Completion: The model receives a prompt and generates the most probable continuation. There are no turns, roles, or any conversation structure. It’s useful for simple tasks like completing code, generating continuous text, or replying to direct questions without previous context.
- Chat Completion: This mode trains the model with a role-based structure (system, user, assistant) and allows you to handle multi-turn interactions. On the inside, the model not only predicts the next token but also does so based on the conversation’ intent.
- Inference Endpoint: At the time you create the connector, an inference endpoint is generated to identify the model with the configured task. We can define an ID and use it in the inference APIs and Kibana.

Using the model in Kibana’s Playground

Uploading data

To test the model, we need some data and confirmation that the _inference API is working. From version 8-17 onwards, machine learning functionalities are dynamic, which means that to download and make the E5 dense multilingual vector available, you only need to use the model.

When you generate the embeddings, the model will be downloaded, and the inference endpoint will run automatically.

Now, let’s upload the text below as RAG context:

Casa Tinta Bistro is a small, family-run restaurant located in the Chapinero neighborhood of Bogotá, Colombia. It was founded in 2019 by siblings Mariana and Lucas Herrera, who combined their love for traditional Colombian flavors with a modern twist. The bistro is best known for its creamy coconut ajiaco, mango-infused arepas, and handcrafted guava lemonade.

The restaurant operates Tuesday through sunday, from 12:00 PM to 9:30 PM, and closes on Mondays. They offer vegetarian and vegan options, and their menu changes slightly every season to incorporate fresh local ingredients. Casa Tinta also hosts monthly poetry nights, where local writers perform their work in front of a small crowd of regulars and newcomers alike.

Although it remains a hidden gem for most tourists, Casa Tinta has a loyal base of local customers and consistently ranks high on community food blogs and private reviews.

store the text in a .txt file and go to Elasticsearch > Home > Upload a file

Click on the button or drag and drop the file over the “Upload data” box. Next, click Import.

Then, select the tab “Advanced” and name the index “bistro_restaurant1.”

Then, click on “Add additional field,” and choose “Add semantic text field.” Change the inference endpoint to “.multilingual-e5-small-elasticsearch.” The configuration should look like this:

To finalize, click on “Add” and then “Import.”

Once uploaded, we can work with this data in Playground.

Testing RAG in Playground

Go to Elasticsearch > Playground in Kibana.

On the Playground screen, you should see a green checkmark and the message “LLM Connected” to indicate that the Vertex connector we’ve just created exists. You can check this link for a more in-depth Playground guide.

Click on the blue button “Add data sources” and choose the bistro_restaurant index we’ve just created.

In Playground, we define the model’s prompt as “You are an assistant for question-answering tasks about the Casa Tinta Bistro restaurant.” Leave the rest of the configuration by default.

Now, we can ask the model any question about the restaurant, and it will consult the index to provide a proper answer.

For example, we can ask about opening hours, and we’ll get the “sources” for the answer. These refer to the IDs of the documents where the information is.

When you ask a question outside the RAG’s context, the model replies with “The provided context does not contain this information” since the answers are grounded in the data.

Conclusion

With the new Vertex AI integration, you can easily use models like Gemini to create a RAG application in Playground that provides answers grounded in your indexed data. Now, take the next step and decide what other sources to index, choose another Vertex AI model, or deploy your own, and put RAG to work for your specific use cases.