Using Ollama with the Inference API

In this article, we'll learn how to connect local models to the Elasticsearch inference model using Ollama and then ask your documents questions using Playground.

Elasticsearch allows users to connect to LLMs using the Open Inference API, supporting providers such as Amazon Bedrock, Cohere, Google AI, Azure AI Studio, HuggingFace - as a service, among others.

Ollama is a tool that allows you to download and execute LLM models using your own infrastructure (your local machine/server). Here you can find a list of the available models that are compatible with Ollama.

Ollama is a great option if you want to host and test different open source models without having to worry about the different ways each of the models could have to be set up, or about how to create an API to access the model functions as Ollama takes care of everything.

Since the Ollama API is compatible with the OpenAI API, we can easily integrate the inference model and create a RAG application using Playground.

Prerequisites

Elasticsearch 8.17
Kibana 8.17
Python

Steps

Setting up Ollama LLM server

We're going to set up a LLM server to connect it to our Playground instance using Ollama. We'll need to:

Download and run Ollama.
Use ngrok to access your local web server that hosts Ollama over the internet

Download and run Ollama

To use Ollama, we first need to download it. Ollama offers support for Linux, Windows, and macOS so just download the Ollama version compatible with your OS here. Once Ollama is installed, we can choose a model from this list of supported LLMs. In this example, we'll use the model llama3.2, a general multilanguage model. In the setup process, you will enable the command line tool for Ollama. Once that’s downloaded you can run the following line:

Which will output:

Once installed, you can test it with this command:

Let's ask a question:

With the model running, Ollama enables an API that would run by default on port "11434". Let's make a request to that API, following the official documentation:

This is the response we got:

Note that the specific response for this endpoint is a streaming.

Expose endpoint to the internet using ngrok

Since our endpoint works in a local environment, it cannot be accessed from another point–like our Elastic Cloud instance–via the internet. ngrok allows us to expose a port offering a public IP. Create an account in ngrok and follow the official setup guide.

Once the ngrok agent has been installed and configured, we can expose the port Ollama is using:

Note: The header --host-header="localhost:11434" guarantees that the "Host" header in the requests matches "localhost:11434"

Executing this command will return a public link that will work as long as the ngrok and the Ollama server run locally.

In "Forwarding" we can see that ngrok generated a URL. Save it for later.

Let's try making an HTTP request to the endpoint again, now using the ngrok-generated URL:

The response should be similar to the previous one.

Creating mappings

ELSER endpoint

For this example, we'll create an inference endpoint using the Elasticsearch inference API. Additionally, we'll use ELSER to generate the embeddings.

For this example, let's imagine that you have a pharmacy that sells two types of drugs:

Drugs that require a prescription.
Drugs that DO NOT require a prescription.

This information would be included in the description field of each drug.

The LLM must interpret this field, so this is the data mappings we'll use:

The field text_description will store the plain text of the descriptions while semantic_field, which is a semantic_text field type, will store the embeddings generated by ELSER.

The property copy_to will copy the content from the fields name and text_description into the semantic field so that the embeddings for those fields are generated.

Indexing data

Now, let's index the data using the _bulk API.

Response:

Asking questions using Playground

Playground is a Kibana tool that allows you to quickly create a RAG system using Elasticsearch indexes and a LLM provider. You can read this article to learn more about it.

Connecting the local LLM to Playground

We first need to create a connector that uses the public URL we've just created. In Kibana, go to Search>Playground and then click on "Connect to an LLM".

This action will reveal a menu on the left side of the Kibana interface. There, click on "OpenAI".

We can now start configuring the OpenAI connector.

Go to "Connector settings" and for the OpenAI provider, select "Other (OpenAI Compatible Service)":

Now, let's configure the other fields. For this example, we'll name our model "medicines-llm". In the URL field, use the one generated by ngrok (/v1/chat/completions). On the "Default model" field, select "llama3.2". We won't use an API Key so just put any random text to proceed:

Click on "Save" and add the index medicines by clicking on "Add data sources":

Great! We now have access to Playground using the LLM we're running locally as RAG engine.

Before testing it, let's add more specific instructions to the agent and up the number of documents sent to the model to 10, so that the answer has the most possible documents available. The context field will be semantic_field, which includes the name and description of the drugs, thanks to the copy_to property.

Now let's ask the question: Can I buy Clonazepam without a prescription? and see what happens:

https://drive.google.com/file/d/1WOg9yJ2Vs5ugmXk9_K9giZJypB8jbxuN/view?usp=drive_link

As expected, we got the correct answer.

Next steps

The next step is to create your own application! Playground provides a code script in Python that you can run on your machine and customize it to meet your needs. For example, by putting it behind a FastAPI server to create a QA medicines chatbot consumed by your UI.

You can find this code by clicking the View code button in the top right section of Playground:

And you use the Endpoints & API keys to generate the ES_API_KEY environment variable required in the code.

For this particular example the code is the following:

To make it work with Ollama, you have to change the OpenAI client to connect to the Ollama server instead of the OpenAI server. You can find the full list of OpenAI examples and compatible endpoints here.

And also change the model to llama3.2 when calling the completion method:

Let’s add our question: Can I buy Clonazepam without a prescription? To the Elasticsearch query:

And also to the completion call with a couple of prints, so we can confirm we are sending the Elasticsearch results as part of the question context:

Now let’s run the command

pip install -qU elasticsearch openai

python main.py

You should see something like this:

Conclusion

In this article, we can see the power and versatility of tools like Ollama when we use them together with the Elasticsearch inference API and Playground.

After some simple steps, we had a working RAG application with a chat that used a LLM running in our own infrastructure at zero cost. This also allows us to have more control over resources and sensitive information, besides giving us access to a variety of models for different tasks.

Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.

Report an issue