Turbocharge your troubleshooting: Building a RAG application for appliance manuals with <strong>elasticsearch</strong>

Want to get Elastic certified? Find out when the next elasticsearch Engineer training is running!

elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.

RAG (Retrieval Augmented Generation) enhances LLMs’ capabilities by using external knowledge bases to enrich generated answers. This blog details the implementation of a simple RAG application built on elasticsearch. This app aims to assist users with home appliance troubleshooting, answering common questions like "How do I reset my dishwasher to its default factory settings?"

We’ll guide you step-by-step, covering:

Uploading an embedding model using the Eland library, and set up an inference API in Elastic to use the uploaded model for text embedding
Creating an elasticsearch index that uses the semantic_text type to store PDF body content
Setting up an inference endpoint for completion to interact with an LLM of your choice
Retrieving relevant documents based on user’s query with semantic_text query type
Bringing all together in one amazing prototype

You will need:

An Elastic Cloud Deployment updated to at least Elastic 8.18.x
An LLM API service (I used OpenAI Azure)

The proposed app is built on a simple Flask application that seamlessly integrates with elasticsearch APIs to execute RAG. A simple frontend complements this, allowing users to upload a user manual for digestion. The full code is available here.

This RAG app receives the PDF manual for the home appliance of your choice. Upon upload, it intelligently breaks down the PDF document page by page and sends the extracted text content of each page to elasticsearch. There, text embeddings for the content are created by relying on an inference model imported into elasticsearch and are stored as dense_vectors. Once the manual is processed, users can then input their queries, leveraging the provided knowledge to quickly find solutions to their appliance issues.

Data collection

As mentioned above, we deployed an elasticsearch instance on an Elastic Cloud deployment. elasticsearch cluster requests are governed by configuration settings, including http:max_content_length. This network HTTP setting, which limits the maximum size of an HTTP request body to a default of 100MB, is not currently configurable in Elastic Cloud.

To bypass this limitation for larger documents, I've implemented a simple Python function that splits the provided PDF into individual pages and stores them in a separate folder.

An extract of this function is provided below:

Create embeddings for your PDF content

Text embedding refers to the process of converting text (or other data types) into a numerical vector representation—specifically, a dense vector. This is a key technique in modern search and machine learning applications, especially for semantic search or similarity search and generative AI.

Elastic offers different ways to rely on text embedding models:

Built-in models: These models are readily available within elasticsearch; they are out-of-the-box and pre-trained, which means you don’t need to fine-tune them on your own data, making them adaptable for various use cases. For example, you can rely on:
1. ELSER, which creates a sparse vector representation of text and supports only English documents.
2. E5, which enables you to perform multi-lingual semantic search by using dense vector representations.
Importing HuggingFace models: Users can import pre-trained models from platforms like HuggingFace. This process leverages Eland, a Python client that provides a Pandas-compatible API for data exploration and analysis within elasticsearch.

In this project, we are uploading a model from HuggingFace. You can install Eland as follows:

Now, open a Bash editor and create a .sh script, filling out each parameter appropriately:

MODEL_ID refers to a model taken from HuggingFace. We will rely on all-MiniLM-L6-v2 because it is small, easily runnable on a CPU, and delivers great performance, especially for the purpose of this demo.

Run the Bash script, and once done, your model should appear in your Elastic deployment under Machine Learning -> Model Management -> Trained Models.

Create an inference endpoint

Starting from version 8.11, Elastic has made its inference API service generally available. This feature lets users create inference endpoints to perform various inference tasks using the Elastic service.

For this demo, we’ll create an endpoint called minilm-l6 to use the model uploaded via Eland. This endpoint will pass the correct model_id corresponding to the model we just uploaded.

Create the index to store the PDF content

Next, we will create an index with a default mapping that utilizes the semantic_text field. This setup requires you to specify the previously created endpoint ID. This endpoint defines the inference service responsible for generating embeddings for the body field's content during ingestion.

For context, semantic_text was introduced in elasticsearch 8.15, and it simplifies semantic search. For an in-depth understanding of our approach, we recommend reviewing our original blog post and the elasticsearch documentation.

Set up a completion endpoint

After generating the main components, a crucial step involves creating an inference completion endpoint within elasticsearch. This endpoint facilitates interaction with your chosen Large Language Model (LLM) by generating prompts and providing the necessary authentication keys.

Specifically, you will utilize a completion task type named completion (refer to the elasticsearch documentation for more details).

Once created, you can test the LLM integration by posting a simple question, as in the following example.

Output:

Search through your data

When a user submits a query through the interface, it initiates a semantic search across existing content to identify relevant PDF passages. Passages that are semantically similar to the user's input are then integrated into the prompt for the LLM to generate the final answer. Semantic queries are streamlined using the previously introduced semantic_text field.

While our example provides a simplified approach, further details on querying semantic_text fields can be found here. A hybrid search approach, which combines classic lexical search (BM25) with semantic search, can be implemented in advanced applications that require enhanced relevance and precision.

Bringing it all together: A RAG application for appliance manuals on elasticsearch

The final step in building our troubleshooting application involves seamlessly integrating the document retrieval process with the LLM to provide intelligent, context-aware responses.

The following system overview illustrates the interaction of each component, providing a high-level recap of the steps involved.

PDF ingestion, embedding & storage:

The process starts with raw PDF appliance manuals, which contain a wealth of text, images, and intricate layouts. Each PDF manual is broken down into individual pages, and the extracted text from each page is then ingested into elasticsearch. Parsing libraries are used to extract text content and metadata (such as model number, manufacturer, and document type), and to identify key sections or figures.
The extracted text from each PDF page is carefully divided into smaller, semantically coherent "chunks." These chunks are stored as dense_vectors within an elasticsearch index, leveraging the semantic_text field type and relying on the inference_endpoint created for text_embedding tasks.

2. User query and context retrieval:

When a user poses a troubleshooting question (e.g., "How long does the ECO Program last?"), it serves as the starting point. The user's query is converted into a dense vector embedding using the same embedding model applied to the ingested document chunks. This ensures that the query vector and document chunk vectors exist in the same semantic space, allowing for meaningful similarity comparisons. The embedded user query is then used to perform a similarity search within our elasticsearch document store. These are the chunks most likely to contain the relevant information needed to answer the question.

3. Prompt construction and LLM input:

Contextualization: The retrieved document chunks (raw text) are then dynamically inserted into a crafted prompt that will be sent to the LLM. The prompt typically includes:
- Instructions for the LLM: Clear guidelines on how to use the provided information, the desired output format, and any constraints.
- The user's original query: The actual question the user asked.
- The retrieved context: The text content of the most relevant chunks, often explicitly labeled as "Context" or "Relevant Documents."
Example prompt structure:

4. LLM generation:

Answer synthesis: The fully constructed prompt, containing both the user's query and the relevant retrieved context, is then sent to the LLM. The LLM processes this input and generates a coherent, contextually accurate, and human-like response based solely on the provided information. It synthesizes the information from the retrieved chunks to formulate a direct answer to the user's question.

Here is an example of the output to our previous question: “How long does the ECO program last?”

Conclusion

In this article, we demonstrated how to build a RAG app with elasticsearch by leveraging functionalities like semantic_text mapping. This simplifies the process of ingesting your data, making it easy to start using semantic search by providing sensible defaults. It also allows you to focus on your search and not on how to index, generate, or query your embeddings.

In addition, we also showcased how to leverage the recent inference_API, which simplifies the usage of machine learning models on your data. For instance, you can use it to perform inference operations (such as generating embeddings for semantic search) and for integrating with an LLM for prompt answering.

Hopefully, this application will also help you solve technical problems without having to spend some extra bucks calling a technician to come to your place.

In this blog post, we may have used third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.

Report an issue