Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.
Imagine this scenario: A customer applies for a $50,000 loan under "Katherine Johnson" on Monday, then submits another application as "Kate Johnson" on Wednesday. Same person, same address, but your system sees them as different applicants. The result? Double approval, regulatory violation, and significant financial loss.
This challenge goes beyond simple typos. It encompasses sophisticated fraud schemes, genuine user errors, and the complex reality of how people actually fill out forms. When your duplicate detection fails, you're not just losing money—you're violating regulations, damaging customer trust, and creating operational chaos.
Let’s build a solution that can catch these duplicates while processing thousands of applications from the backend. Using Elasticsearch's phonetic search capabilities and an LLM, we'll create a system that's both powerful and practical.
The hidden complexity of duplicate detection
The challenge of duplicate detection runs deeper than most organizations realize. Consider these real-world scenarios that traditional systems struggle with:
Name variations that fool systems:
- Classic misspellings: "John Smith" vs "Jon Smith" vs "Jonathon Smith" – Is it a typo or a different person?
- Phonetic confusion: "Shawn" vs "Shaun" vs "Sean" vs "Shon" – Same pronunciation, different spellings
- Nicknames and variations: "Alexander" vs "Alex" vs "Xander" – One person, multiple identities
Address challenges that create blind spots:
- Street abbreviations: "123 Maple Street" vs "123 Maple St" vs "123 Maple Avenue" – Same location, different formats
- Apartment variations: "Unit 5, 42 Elm Road" vs "42 Elm Rd, Apt 5" – Same address, different structure
- City name confusion: "Los Angeles" vs "LA" – Geographic variations that systems miss
Family relationships that complicate detection:
- Similar Names, Same Address: "Bob Brown" and "Rob Brown" at "789 Pine Rd" – They're twins, not duplicates
- Generational Suffixes: "James Carter Sr." vs "James Carter Jr." – A single letter difference that changes everything
Without a reliable way to flag such duplicate records, organizations might inadvertently approve multiple loans or insurance policies for the same customer, violating eligibility criteria, increasing the risk of defaults, and causing revenue loss.
The solution
By combining Elasticsearch with modern AI models, we can build a smart, scalable, and cost-effective solution to identify and remove duplicate records.
Phonetic search for names:
Elasticsearch supports phonetic algorithms that help find names that sound alike, even if they’re spelled differently. For example, names like “Smith” and “Smyth” are treated the same because they’re pronounced the same way. This allows the system to catch duplicates that a basic text match would miss. You can think of it as teaching the search engine to “listen” to names the way people do—so “John” and “Jon” are understood as the same.
Handling address variations in searches:
User-provided addresses often vary in format. We use an AI model to generate different forms or synonyms of an address—like “Syd.” and “Sydney”, “Bengaluru” and “Bangalore”—and use these variations to make our Elasticsearch queries more effective. This helps match addresses even when the user input doesn’t exactly match what’s stored in the system.
Using AI for deduplication checks:
Once we retrieve possible matches from Elasticsearch, we pass them to an AI model that checks for duplicates. While we could use algorithms like Levenshtein or Jaro-Winkler instead of an AI model, things get complicated when you add more fields like date of birth, national ID, or phone numbers. Besides, using the model brings in flexibility and helps simplify this logic by looking at the data holistically, making it easier to identify true duplicates across multiple fields.
Architecture overview
Here's how our solution works at a high level:

Try the experience yourself!
Pre-requisites and setup
Before we dive into implementation, let's ensure we have everything that is needed for us.
Required infrastructure
- Elasticsearch cluster – You will need access to an Elasticsearch cluster. For this setup, I used Elasticsearch Cloud Hosted version 9.0.0. If you do not have a cluster ready, you have a couple of options:
- Elastic Cloud - You can create a new cluster here and choose between an Elastic Cloud Hosted or an Elasticsearch Serverless option.Local Setup - If you prefer running it locally, you can spin up a cluster using the provided script here.
- Local Setup - If you prefer running it locally, you can spin up a cluster using the provided script here.
- Phonetic analysis plugin – To support phonetic name matching, make sure the Phonetic Analysis Plugin is enabled in your Elasticsearch setup.
- Ollama LLM server – Since we are dealing with sensitive details like names, addresses, and dates of birth, we recommend setting up the Local LLMs. We can run something lightweight like LLaMa 3.2 8B using Ollama. It’s fast, runs locally, and works well for this kind of data handling.
- To get started, download and install the Ollama version compatible with your operating system from here.
- Once installed, run
ollama run llama3:8b
to pull and run the model.
- Sample dataset – To test the setup and mimic real-world cases, I’ve prepared a small dataset with variations in names, addresses, and other subtle differences. You can download the 101-record sample dataset from this link.
Screenshot of the sample dataset is below:

Development environment:
These installations provide:
elasticsearch
: Client library for Elasticsearch connectivitypandas
: Data processing and CSV handlinglangchain-community
: OpenAI integration for AI analysisstreamlit
: Interactive web interfacelocaltunnel
: Local development exposure
Step 1: Connecting to Elasticsearch
We need the Elasticsearch endpoint and API Key for authentication.
Getting your Elasticsearch endpoint:
- Log into Elastic Cloud
- Navigate to your deployment
- Copy the Elasticsearch endpoint from the deployment overview
Creating an API key:
- Open Kibana from your deployment
- Navigate to Stack Management → API Keys
- Create a new API key and store it securely
Once you have your credentials, establish the connection.
Save and execute the code below in a file named “es_connect.py”. Remember to include values for ES_URL and API_KEY.
Step 2: Creating the index template
The heart of our duplicate detection system lies in the index configuration, especially when we tell Elasticsearch to generate phonetic codes using names. We'll create an index template that uses both phonetic name matching and the best matches on the address.
Understanding phonetic analyzers:
Our template uses two phonetic algorithms:
- Double Metaphone: Handles complex phonetic variations and works well with diverse names
- Soundex: Provides consistent coding for similar-sounding names
Here's our complete index template.
Save and execute the code below into a file named “create_index_template.py”
What this template does
- Phonetic processing: Names are automatically converted to phonetic codes during indexing
- Multi-field analysis: Each name is analyzed with both Double Metaphone and Soundex
- Address optimization: Addresses are indexed for both full-text and exact matching
- Flexible matching: The template supports various search strategies for different use cases
Step 3: Loading and indexing data
Now, let's load our sample dataset and index it into Elasticsearch for searching. Save and execute the following code into a file named “index_csv_data.py”
Step 4: Initiate the Llama model and generate address variations
The Input Address from the end user is taken, and an LLM call is made to generate the variations of it to mitigate some nuances.
For instance, if the user enters "123 Maple St., Syd," the model will generate keywords such as ["123 Maple St., Sydney","Street","Str","Sydnei","Syd"].
Step 5: Build the final search query
The search query is built using values from the names and address variations generated from above.
Step 6: Check for duplicates
The above search query would find potential matches. Later, these names and addresses will be supplied to the model as context. Using the function below, we will prompt the model to calculate the probability of them being duplicates.
Step 7: Creating the Streamlit interface
Now, let's use the Streamlit code below to create a clean, intuitive interface.
Step 8: Execute and test the system
To optimize performance by preventing repeated model reloads and excessive connection openings, the code from Steps 4, 5, 6, and 7 will be consolidated into a single file named app.py
. This integrated file will then be used to launch the Streamlit UI.
After the execution, a UI will be generated, enabling input of name and address. The results, displayed in a table, will be sorted by match percentage and include an explanation for potential duplicates, as shown in the screenshot below.

Use cases beyond loans and insurance applications
Deduplication has wide-ranging applications across various industries and domains. Here are some key examples
- Government and public services - Flagging duplicate voter registrations, tax records, social security applications, or public welfare program registrations.
- Customer Relationship Management (CRM) - Identifying duplicate customer records in CRM databases to improve data quality and avoid redundancy.
- Healthcare systems - Detecting duplicate patient records in hospital management systems to ensure accurate medical history and billing.
- E-commerce platforms - Identifying duplicate product listings or seller profiles to maintain a clean catalog and enhance the user experience.
- Real estate and property management - Spotting duplicate listings of properties or tenants in property management systems.
Conclusion: Elasticsearch for deduplication pipeline
What we've built here demonstrates how combining Elasticsearch's phonetic capabilities with local LLM processing creates a robust deduplication pipeline that addresses real-world complexity.
Firstly, we prepared the cluster with the required dataset and hosted a local model. Then, to find a similar match, we queried Elasticsearch using key entities like name, address, and address variations. Later, the Elastic response was passed as context to the model for duplication analysis. Based on the instructions, the model decided which record from Elastic was a possible duplicate.
Remember that duplicate detection is not a one-time project–it's an ongoing process that improves with experience. The AI components learn from feedback, the search algorithms can be refined based on results, and the system becomes more accurate over time.
By implementing Elasticsearch for these use cases, businesses can stay ahead of the curve, ensuring accuracy, compliance, and a competitive edge in a rapidly evolving market.