Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.
LangExtract is an open-source Python library created by Google that helps transform unstructured text into structured information using multiple LLMs and custom instructions. Unlike using an LLM alone, LangExtract provides structured and traceable outputs, links each extraction back to the original text, and offers visual tools for validation, making it a practical solution for information extraction in different contexts.
LangExtract is useful when you want to transform unstructured data—such as contracts, invoices, books, etc—into a defined structure, making it searchable and filterable. For example, classify expenses of an invoice, extract the parties in a contract, or even detect the sentiments of the characters of a certain paragraph in a book.
LangExtract also offers features such as long context handling, remote file loading, multiple passes to improve recall, and multiple workers to parallelize the work.
Use case
To demonstrate how LangExtract and Elasticsearch work together, we will use a dataset of 10 contracts of different types. These contracts contain standard data such as costs, amounts, dates, duration, and contractor. We will use LangExtract to extract structured data from the contracts and store them as fields in Elasticsearch, allowing us to run queries and filters against them.
You can find the full notebook here.
Steps
- Installing dependencies and importing packages
- Setting up Elasticsearch
- Extracting data with LangExtract
- Querying data
Installing dependencies and importing packages
We need to install LangExtract to process the contracts and extract structured data from them, as well as the elasticsearch
client to handle Elasticsearch requests.
With the dependencies installed, let’s import the following:
- json – which helps to handle JSON data.
- os – to access local environment variables.
- glob – used to search files in directories based on a pattern.
- google.colab – useful in Google Colab notebooks to load locally stored files.
- helpers – which provide extra Elasticsearch utilities, for example, to insert or update multiple documents in bulk.
- IPython.display.HTML – which allows you to render HTML content directly inside a notebook, making outputs more readable.
- getpass – used to securely input sensitive information, such as passwords or API keys, without displaying them on the screen.
Setting up Elasticsearch
Setup keys
We now need to set some variables before developing the app. We will use Gemini AI as our model. Here you can learn how to obtain an API key from the Google AI Studio. Also, make sure you have an Elasticsearch API key available.
Elasticsearch client
Index mappings
Let’s define the Elasticsearch mappings for the fields we are going to extract with LangExtract. Note that we use keyword for the fields we want to use exclusively for filters, and text + keyword for the ones we plan to search and filter on.
Extracting data with LangExtract
Providing examples
The LangExtract code defines a training example for LangExtract that shows how to extract specific information from contracts.
The contract_examples
variable contains an ExampleData
object that includes:
- Example text: A sample contract with typical information like dates, parties, services, payments, etc.
- Expected extractions: A list of extraction objects that map each piece of information from the text to a specific class (
extraction_class
) and its normalized value (extraction_text
). Theextraction_class
will be the field name, and the extraction_text will be the value of that field.
For example, the date "March 10, 2024" from the text is extracted as class contract_date
(field name) with normalized value "03/10/2024" (field value). The model learns from these patterns to extract similar information from new contracts.
The contract_prompt_description
provides additional context about what to extract and in what order, complementing what the examples alone cannot express.
Dataset
You can find the entire dataset here. Below is an example of what the contracts look like:
Some data is explicitly written in the document, but other values can be inferred and converted by the model. For example, dates will be formatted as dd/MM/yyyy
, and duration in months will be converted to days.
Running extraction
In a Colab notebook, you can load the files with:
LangExtract extracts fields and values with the lx.extract
function. It must be called for each contract, passing the content, prompt, example, and model ID.
To better understand the extraction process, we can save the extraction results as an NDJSON file:
The line lx.visualize(NDJSON_FILE)
generates an HTML visualization with the references from a single document, where you can see the specific lines where the data was extracted.

The extracted data from one contract will look like this:
Based on this result, we will index the data into Elasticsearch and query it.
Querying data
Indexing data to Elasticsearch
We use the _bulk API to ingest the data into the contracts
index. We are going to store each of the extraction_class
results as new fields, and the extraction_text
as the values of those fields.
With that, we are ready to start writing queries:
Querying data
Now, let’s query contracts that have expired and have a payment amount greater than or equal to 15,000.
And here are the results:
Conclusion
LangExtract makes it easier to extract structured information from unstructured documents, with clear mappings and traceability back to the source text. Combined with Elasticsearch, this data can be indexed and queried, enabling filters and searches over contract fields like dates, payment amounts, and parties.
In our example, we kept the dataset simple, but the same flow can scale to larger collections of documents or different domains such as legal, financial, or medical text. You can also experiment with more extraction examples, custom prompts, or additional post-processing to refine the results for your specific use case.