Everyone is talking about DeepSeek R1, the Chinese hedge fund High-Flyer's new large language model. The news is full of speculation on what it means for the industry now that they've introduced a capable, chain-of-thought reasoning LLM with open weights. For those curious to try this new model with RAG and all the vector database smarts of Elasticsearch, here’s a quick tutorial to get you started using DeepSeek R1 using local inference. Along the way we’ll use Elastic’s Playground feature and even discover some good and bad properties of Deepseek R1 for RAG.
Here’s a diagram of what we’ll configure in this tutorial:
Setting up local inference with Ollama
Ollama is a great way to quickly test out a curated set of open source models for local inferencing and is a popular tool for AI developers.
Running Ollama bare metal
A local install on Mac, Linux, or Windows is the easiest way to make use of any local GPU capability you might have, especially for those with M series Apple chips. Once you have Ollama installed, you can download and run DeepSeek R1 with the following command.
You might want to adjust the parameter size to something that suits your hardware. Available sizes can be found here.
You can chat with the model in the terminal, but the model remains running when you CTL+d out of the command or type “/bye”. To see the model still running enter:
Running Ollama in a container
Alternatively, the quickest way of running Ollama is by utilizing a container engine like Docker. Using your local machine’s GPU isn’t always as simple, depending on your environment, but getting a quick test setup isn’t difficult as long as your container has the RAM and storage to fit the multi-GB models.
Getting Ollama up and running in Docker is as easy as executing:
This will create a directory called “ollama” in the current directory and mount it inside the container to store the Ollama config, but also the models. Depending on the number of parameters used, they can range from some GBs to tens of GBs, so make sure to choose a volume with enough free space.
Note: If you happen to have a Nvidia GPU in your machine, make sure to install the Nvidia container toolkit and add “--gpus=all” to the docker run command above.
Once the Ollama container is up and running on your machine, you can pull a model like deepseek-r1 with:
Similar to the bare metal approach, you might want to adjust the parameter size to something that suits your hardware. Available sizes can be found at https://ollama.com/library/deepseek-r1.
Once pulling the model finishes, you can type “/bye” to quit the prompt. To verify the model is still running:
Testing our local inference with a curl
To test the local inference with curl, you can run the following command. We are using stream:false so that we can read the JSON narrative response easily:
Testing ‘OpenAI Compatible’ Ollama and a RAG prompt
Conveniently, Ollama also serves a REST endpoint that mimics the behavior of OpenAI for compatibility with a wide range of tools including Kibana.
Testing this more complex prompt results in content that has a <think> section where the model has been trained to reason through the problem.
Connecting Ollama to Kibana
A great way to use Elasticsearch is the “start-local” dev script.
Make sure your Kibana and Elastisearch are able to reach your Ollama on the network. If you are using a local container setup of the Elastic stack that might mean replacing “localhost” with “host.docker.internal” or “host.containers.internal” to get a network path to the host machine.
In Kibana, navigate to Stack Management > Alerts and Insights > Connectors.
What to do if you see this is common setup warning
You’ll need to make sure the xpack.encryptedSavedObjects.encryptionKey is set correctly. This is a common missed step when running a local docker install of Kibana so I’ll list the steps to fix in the Docker syntax.
Make sure you are persisting your kibana/config directory so changes are saved when the container shuts down. My Kibana container volumes looks like this in docker-compose.yml:
Now you can create the keystore and put a value in so that Connector keys are not stored in plaintext.
Fully reboot your entire cluster to make sure the changes take effect.
Creating the Connector
From the Connector configuration screen (In Kibana, navigate to Stack Management > Alerts and Insights > Connectors), create a connector and select the “OpenAI” type.
Configure the connector with the following settings
- Connector name: Deepseek (Ollama)
- Select an OpenAI provider: other (OpenAI Compatible Service)
- URL: http://localhost:11434/v1/chat/completions
- Adjust for the correct path to your ollama. Remember to substitute host.docker.internal or equivalent if you are calling from within a container
- Default model: deepseek-r1:7b
- API key: make something up, an entry is needed but the value doesn’t matter
Note that testing a custom connector to Ollama in the connector setup is currently broken in 8.17, but has been fixed in the upcoming 8.18 build of Kibana.
Our connector looks like this:
Getting vector embedded data into Elasticsearch
If you are already familiar with Playground and have data set up, you can skip to the Playground step below, but if you need some quick test data we’ll need to make sure we have our _inference APIs set up. Starting in 8.17, machine learning allocations are dynamic so to download and turn on the e5 multi-lingual dense vector we’ll just need to run the following in Kiban Dev tools.
If you haven’t already, this will trigger the download of the e5 model from Elastic’s model repositories.
Next, let’s load a public domain book as our RAG context. Here’s a place to download “Alice’s Adventures in Wonderland” from Project Gutenberg: link. Save this as a .txt file.
Navigate to Elasticsearch > Home > Upload a file
Select or drag and drop your text file and then hit the “Import” button.
On the “Import data” screen select the “Advanced” tab and then set the index name to “book_alice”.
Select the “Add additional field” option, it is small right below “Automatically created fields”. Select “Add semantic text field” and change the inference endpoint to “.multilingual-e5-small-elasticsearch”. Select Add and then Import.
When the load and inferencing is done, we are ready to head to Playground.
Testing RAG in Playground
Navigate to Elasticsearch > Playground in Kibana.
On the playground screen, you should see a green checkmark and “LLM Connected” to indicate a connector exits. This is the Ollama connector we just created above. A longer guide for Playground can be found here.
Click the blue Add data sources and select the book_alice index we made previously or another index you’ve previously configured that utilizes inference APIs for embeddings.
Deepseek is a chain-of-thought model with strong alignment characteristics. This is both good and bad from a RAG perspective. The chain-of-thought training may help Deepseek rationalize seemingly contradicting statements in citations, but the strong alignment to training knowledge may make it prefer its own version of world facts over our context grounding. While well intentioned, this strong alignment is known to make LLMs difficult to instruct when discussing topics where our private knowledge contracts or isn’t well represented in the training data set.
In our Playground setup we entered the following system prompt “You are an assistant for question-answering tasks using relevant text passages from the book Alice in wonderland” and accepted the other defaults.
To the question “Who was at the tea party?” we get the answer: “Answer: The March Hare, the Hatter, and the Dormouse were at the tea party. [Citation: position 1 and 2]” which is correct.
We can see in the <think> tags that Deepseek definitely pondered the contents of the citations to answer the questions.
Testing alignment limitations
Let’s create an intellectually challenging scenario for Deepseek as a test. We’ll create an index of conspiracy theories that Deepseek’s training data knows are not true.
In Kibana dev tools let’s create the following index and data:
These conspiracy theories will be our grounding for the LLM. Despite putting in an aggressive system prompt, Deepseek won’t accept our version of the facts. If we were in a situation where we knew our private data was more trustworthy, grounded, or aligned to our organization’s needs this would not be acceptable:
To the test question “are birds real?” (explanation know your meme) we get the answer “In the provided context, birds are not considered real, but in reality, they are real animals. [Context: position 1]”. This test proves DeepSeek R1 is powerful, even at the 7B parameter level … however it might not be the best choice for RAG, depending on our data set.
So what did we learn?
In summary:
- Running models locally in tools like Ollama is a great option for taking a peek at model behavior.
- DeepSeek R1 is a reasoning model, which means it has advantages and disadvantages for use cases like RAG.
- Playground is able to connect to inference hosting frameworks like Ollama through a OpenAI-like REST API, which is becoming a de facto standard in this early era of AI hosting.
Overall, we are impressed with how far local, “air gapped” RAG has come. The tools in Elasticsearch, Kibana, and the available open weights models have advanced significantly since we first wrote about Privacy-first AI Search in 2023.
Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!
Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.