Using LangExtract and Elasticsearch

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.

LangExtract is an open-source Python library created by Google that helps transform unstructured text into structured information using multiple LLMs and custom instructions. Unlike using an LLM alone, LangExtract provides structured and traceable outputs, links each extraction back to the original text, and offers visual tools for validation, making it a practical solution for information extraction in different contexts.

LangExtract is useful when you want to transform unstructured data—such as contracts, invoices, books, etc—into a defined structure, making it searchable and filterable. For example, classify expenses of an invoice, extract the parties in a contract, or even detect the sentiments of the characters of a certain paragraph in a book.

LangExtract also offers features such as long context handling, remote file loading, multiple passes to improve recall, and multiple workers to parallelize the work.

Use case

To demonstrate how LangExtract and Elasticsearch work together, we will use a dataset of 10 contracts of different types. These contracts contain standard data such as costs, amounts, dates, duration, and contractor. We will use LangExtract to extract structured data from the contracts and store them as fields in Elasticsearch, allowing us to run queries and filters against them.

You can find the full notebook here.

Steps

Installing dependencies and importing packages
Setting up Elasticsearch
Extracting data with LangExtract
Querying data

Installing dependencies and importing packages

We need to install LangExtract to process the contracts and extract structured data from them, as well as the elasticsearch client to handle Elasticsearch requests.

With the dependencies installed, let’s import the following:

json – which helps to handle JSON data.
os – to access local environment variables.
glob – used to search files in directories based on a pattern.
google.colab – useful in Google Colab notebooks to load locally stored files.
helpers – which provide extra Elasticsearch utilities, for example, to insert or update multiple documents in bulk.
IPython.display.HTML – which allows you to render HTML content directly inside a notebook, making outputs more readable.
getpass – used to securely input sensitive information, such as passwords or API keys, without displaying them on the screen.

Setting up Elasticsearch

Setup keys

We now need to set some variables before developing the app. We will use Gemini AI as our model. Here you can learn how to obtain an API key from the Google AI Studio. Also, make sure you have an Elasticsearch API key available.

Elasticsearch client

Index mappings

Let’s define the Elasticsearch mappings for the fields we are going to extract with LangExtract. Note that we use keyword for the fields we want to use exclusively for filters, and text + keyword for the ones we plan to search and filter on.

Extracting data with LangExtract

Providing examples

The LangExtract code defines a training example for LangExtract that shows how to extract specific information from contracts.

The contract_examples variable contains an ExampleData object that includes:

Example text: A sample contract with typical information like dates, parties, services, payments, etc.
Expected extractions: A list of extraction objects that map each piece of information from the text to a specific class (extraction_class) and its normalized value (extraction_text). The extraction_class will be the field name, and the extraction_text will be the value of that field.

For example, the date "March 10, 2024" from the text is extracted as class contract_date (field name) with normalized value "03/10/2024" (field value). The model learns from these patterns to extract similar information from new contracts.

The contract_prompt_description provides additional context about what to extract and in what order, complementing what the examples alone cannot express.

Dataset

You can find the entire dataset here. Below is an example of what the contracts look like:

Some data is explicitly written in the document, but other values can be inferred and converted by the model. For example, dates will be formatted as dd/MM/yyyy, and duration in months will be converted to days.

Running extraction

In a Colab notebook, you can load the files with:

LangExtract extracts fields and values with the lx.extract function. It must be called for each contract, passing the content, prompt, example, and model ID.

To better understand the extraction process, we can save the extraction results as an NDJSON file:

The line lx.visualize(NDJSON_FILE) generates an HTML visualization with the references from a single document, where you can see the specific lines where the data was extracted.

The extracted data from one contract will look like this:

Based on this result, we will index the data into Elasticsearch and query it.

Querying data

Indexing data to Elasticsearch

We use the _bulk API to ingest the data into the contracts index. We are going to store each of the extraction_class results as new fields, and the extraction_text as the values of those fields.

With that, we are ready to start writing queries:

Querying data

Now, let’s query contracts that have expired and have a payment amount greater than or equal to 15,000.

And here are the results:

Conclusion

LangExtract makes it easier to extract structured information from unstructured documents, with clear mappings and traceability back to the source text. Combined with Elasticsearch, this data can be indexed and queried, enabling filters and searches over contract fields like dates, payment amounts, and parties.

In our example, we kept the dataset simple, but the same flow can scale to larger collections of documents or different domains such as legal, financial, or medical text. You can also experiment with more extraction examples, custom prompts, or additional post-processing to refine the results for your specific use case.

Report an issue