This is a cache of https://www.elastic.co/search-labs/blog/unstructured-data-processing-nv-ingest-elastic. It is a snapshot of the page at 2025-05-09T00:43:48.747+0000.
Unstructured data processing with NV‑Ingest, Unstructured, and Elasticsearch - Elasticsearch Labs

Unstructured data processing with NV‑Ingest, Unstructured, and Elasticsearch

Learn how to build a scalable data pipeline for unstructured documents using NV-Ingest, Unstructured Platform, and Elasticsearch for RAG applications.

Elasticsearch has native integrations to industry leading Gen AI tools and providers. Check out our webinars on going Beyond RAG Basics, or building prod-ready apps Elastic Vector Database.

To build the best search solutions for your use case, start a free cloud trial or try Elastic on your local machine now.

In this blog, we will discuss how to implement a scalable data processing pipeline using NV-Ingest, Unstructured Platform and Elasticsearch. This pipeline transforms unstructured data from a data source into structured, searchable content ready for downstream AI applications, such as RAG. Retrieval Augmented Generation (RAG) is an AI technique where Large Language Models (LLMs) are provided with external knowledge to generate responses to user queries. This allows LLM responses to be tailored to specific context, making answers more accurate and relevant.

Before we get started, let’s take a look at the key components enabling this pipeline and what each brings to the table.

Pipeline components

NV-Ingest is a set of microservices for transforming unstructured documents into structured content and metadata. It handles document parsing, visual structure identification, and OCR processing at scale.

Unstructured is an ETL+ platform for orchestrating the entirety of unstructured data processing: from ingesting unstructured data from multiple data sources, converting raw, unstructured files into structured data through a configurable workflow engine, enriching data with additional transformations, all the way to uploading the results into vector stores, databases and search engines. It provides a visual UI, APIs, and scalable backend infrastructure to orchestrate document parsing, enrichment, and embedding in a single workflow.

Elasticsearch is an industry-leading search and analytics engine that now includes native vector search capabilities. It can function as both a traditional text database and a vector database, enabling semantic search at scale with features like k-NN similarity search.

Now that we’ve introduced the core components, let’s take a look at how they work together in a typical workflow before diving into the implementation.

RAG with NV-Ingest - Unstructured - Elasticsearch

While here we only provide key highlights, you can find the full notebook here.

This blog can be divided into 3 parts:

  • Setting up the source and destination connectors
  • Setting up the workflow with Unstructured API
  • RAG over the processed data

Unstructured workflow is represented as a DAG where the nodes, called connectors, control where the data is ingested from and where the processed results are uploaded to. These nodes are required in any workflow. A source connector configures ingestion of the raw data from a data source, and the destination connector configures the data uploading of the processed data into a vector store, search engine, or a database.

For this blog, we store research papers in Amazon S3 and we want the processed data to be delivered into Elasticsearch for downstream use. This means that before we can build a data processing workflow, we need to create a source connector for Amazon S3, and a destination connector for Elasticsearch with Unstructured API.

Step 1: Setting up the S3 source connector

When creating a source connector, you need to give it a unique name, specify its type (e.g. S3, or Google drive), and provide the configuration which typically contains the location of the source you're connecting to (e.g. S3 bucket URI, or Google drive folder) and authentication details.

Step 2: Setting up the Elasticsearch destination connector

Next, let’s set up the Elasticsearch destination connector. The Elasticsearch index that you use must have a schema that is compatible with the schema of the documents that Unstructured produces for you—you can find all the details in the documentation.

Step 3: Creating a workflow with Unstructured

Once you have the source and destination connectors, you can create a new data processing workflow. We’ll build the workflow DAG with the following nodes:

  • NV-Ingest for document partitioning
  • Unstructured’s Image Summarizer, Table Summarizer, and Named Entity Recognition nodes for content enrichment
  • Chunker and Embedder nodes for making the content ready for similarity search

Once your job for this workflow completes, the data is uploaded into Elasticsearch and we can proceed with building a basic RAG application.

Step 4: RAG setup

Let's go ahead with a simple retriever that will connect to the data, take in the user query, embed it with the same model that was used to embed the original data, and calculate cosine similarity to retrieve the top 3 documents.

Then let's set up a workflow to receive a user query, fetch similar documents from Elasticsearch, and use the documents as context to answer the user’s question.

Putting everything together we get:

And a response:

Elasticsearch provides various strategies to enhance search, including Hybrid search, a combination of approximate semantic search and keyword-based search.

This approach can improve the relevance of the top documents used as context in the RAG architecture. To enable it, you need to modify the vector_store initialization as follows:

Conclusion

Good RAG starts with well-prepared data, and Unstructured simplifies this critical first step. By enabling partitioning with NV-Ingest, metadata enrichment of unstructured data and efficient ingestion into Elasticsearch, it ensures that your RAG pipeline is built on a solid foundation, unlocking its full potential for all your downstream tasks.

Related content

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as your are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself