Building intelligent duplicate detection with Elasticsearch and AI

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.

Imagine this scenario: A customer applies for a $50,000 loan under "Katherine Johnson" on Monday, then submits another application as "Kate Johnson" on Wednesday. Same person, same address, but your system sees them as different applicants. The result? Double approval, regulatory violation, and significant financial loss.

This challenge goes beyond simple typos. It encompasses sophisticated fraud schemes, genuine user errors, and the complex reality of how people actually fill out forms. When your duplicate detection fails, you're not just losing money—you're violating regulations, damaging customer trust, and creating operational chaos.

Let’s build a solution that can catch these duplicates while processing thousands of applications from the backend. Using Elasticsearch's phonetic search capabilities and an LLM, we'll create a system that's both powerful and practical.

The hidden complexity of duplicate detection

The challenge of duplicate detection runs deeper than most organizations realize. Consider these real-world scenarios that traditional systems struggle with:

Name variations that fool systems:

Classic misspellings: "John Smith" vs "Jon Smith" vs "Jonathon Smith" – Is it a typo or a different person?
Phonetic confusion: "Shawn" vs "Shaun" vs "Sean" vs "Shon" – Same pronunciation, different spellings
Nicknames and variations: "Alexander" vs "Alex" vs "Xander" – One person, multiple identities

Address challenges that create blind spots:

Street abbreviations: "123 Maple Street" vs "123 Maple St" vs "123 Maple Avenue" – Same location, different formats
Apartment variations: "Unit 5, 42 Elm Road" vs "42 Elm Rd, Apt 5" – Same address, different structure
City name confusion: "Los Angeles" vs "LA" – Geographic variations that systems miss

Family relationships that complicate detection:

Similar Names, Same Address: "Bob Brown" and "Rob Brown" at "789 Pine Rd" – They're twins, not duplicates
Generational Suffixes: "James Carter Sr." vs "James Carter Jr." – A single letter difference that changes everything

Without a reliable way to flag such duplicate records, organizations might inadvertently approve multiple loans or insurance policies for the same customer, violating eligibility criteria, increasing the risk of defaults, and causing revenue loss.

The solution

By combining Elasticsearch with modern AI models, we can build a smart, scalable, and cost-effective solution to identify and remove duplicate records.

Phonetic search for names:

Elasticsearch supports phonetic algorithms that help find names that sound alike, even if they’re spelled differently. For example, names like “Smith” and “Smyth” are treated the same because they’re pronounced the same way. This allows the system to catch duplicates that a basic text match would miss. You can think of it as teaching the search engine to “listen” to names the way people do—so “John” and “Jon” are understood as the same.

Handling address variations in searches:

User-provided addresses often vary in format. We use an AI model to generate different forms or synonyms of an address—like “Syd.” and “Sydney”, “Bengaluru” and “Bangalore”—and use these variations to make our Elasticsearch queries more effective. This helps match addresses even when the user input doesn’t exactly match what’s stored in the system.

Using AI for deduplication checks:

Once we retrieve possible matches from Elasticsearch, we pass them to an AI model that checks for duplicates. While we could use algorithms like Levenshtein or Jaro-Winkler instead of an AI model, things get complicated when you add more fields like date of birth, national ID, or phone numbers. Besides, using the model brings in flexibility and helps simplify this logic by looking at the data holistically, making it easier to identify true duplicates across multiple fields.

Architecture overview

Here's how our solution works at a high level:

Try the experience yourself!

Pre-requisites and setup

Before we dive into implementation, let's ensure we have everything that is needed for us.

Required infrastructure

Elasticsearch cluster – You will need access to an Elasticsearch cluster. For this setup, I used Elasticsearch Cloud Hosted version 9.0.0. If you do not have a cluster ready, you have a couple of options:
- Elastic Cloud - You can create a new cluster here and choose between an Elastic Cloud Hosted or an Elasticsearch Serverless option.Local Setup - If you prefer running it locally, you can spin up a cluster using the provided script here.
- Local Setup - If you prefer running it locally, you can spin up a cluster using the provided script here.
Phonetic analysis plugin – To support phonetic name matching, make sure the Phonetic Analysis Plugin is enabled in your Elasticsearch setup.
Ollama LLM server – Since we are dealing with sensitive details like names, addresses, and dates of birth, we recommend setting up the Local LLMs. We can run something lightweight like LLaMa 3.2 8B using Ollama. It’s fast, runs locally, and works well for this kind of data handling.
- To get started, download and install the Ollama version compatible with your operating system from here.
- Once installed, run ollama run llama3:8b to pull and run the model.
Sample dataset – To test the setup and mimic real-world cases, I’ve prepared a small dataset with variations in names, addresses, and other subtle differences. You can download the 101-record sample dataset from this link.

Screenshot of the sample dataset is below:

Development environment:

These installations provide:

elasticsearch: Client library for Elasticsearch connectivity
pandas: Data processing and CSV handling
langchain-community: OpenAI integration for AI analysis
streamlit: Interactive web interface
localtunnel: Local development exposure

Step 1: Connecting to Elasticsearch

We need the Elasticsearch endpoint and API Key for authentication.

Getting your Elasticsearch endpoint:

Log into Elastic Cloud
Navigate to your deployment
Copy the Elasticsearch endpoint from the deployment overview

Creating an API key:

Open Kibana from your deployment
Navigate to Stack Management → API Keys
Create a new API key and store it securely

Once you have your credentials, establish the connection.

Save and execute the code below in a file named “es_connect.py”. Remember to include values for ES_URL and API_KEY.

Step 2: Creating the index template

The heart of our duplicate detection system lies in the index configuration, especially when we tell Elasticsearch to generate phonetic codes using names. We'll create an index template that uses both phonetic name matching and the best matches on the address.

Understanding phonetic analyzers:

Our template uses two phonetic algorithms:

Double Metaphone: Handles complex phonetic variations and works well with diverse names
Soundex: Provides consistent coding for similar-sounding names

Here's our complete index template.

Save and execute the code below into a file named “create_index_template.py”

What this template does

Phonetic processing: Names are automatically converted to phonetic codes during indexing
Multi-field analysis: Each name is analyzed with both Double Metaphone and Soundex
Address optimization: Addresses are indexed for both full-text and exact matching
Flexible matching: The template supports various search strategies for different use cases

Step 3: Loading and indexing data

Now, let's load our sample dataset and index it into Elasticsearch for searching. Save and execute the following code into a file named “index_csv_data.py”

Step 4: Initiate the Llama model and generate address variations

The Input Address from the end user is taken, and an LLM call is made to generate the variations of it to mitigate some nuances.

For instance, if the user enters "123 Maple St., Syd," the model will generate keywords such as ["123 Maple St., Sydney","Street","Str","Sydnei","Syd"].

Step 5: Build the final search query

The search query is built using values from the names and address variations generated from above.

Step 6: Check for duplicates

The above search query would find potential matches. Later, these names and addresses will be supplied to the model as context. Using the function below, we will prompt the model to calculate the probability of them being duplicates.

Step 7: Creating the Streamlit interface

Now, let's use the Streamlit code below to create a clean, intuitive interface.

Step 8: Execute and test the system

To optimize performance by preventing repeated model reloads and excessive connection openings, the code from Steps 4, 5, 6, and 7 will be consolidated into a single file named app.py. This integrated file will then be used to launch the Streamlit UI.

After the execution, a UI will be generated, enabling input of name and address. The results, displayed in a table, will be sorted by match percentage and include an explanation for potential duplicates, as shown in the screenshot below.

Use cases beyond loans and insurance applications

Deduplication has wide-ranging applications across various industries and domains. Here are some key examples

Government and public services - Flagging duplicate voter registrations, tax records, social security applications, or public welfare program registrations.
Customer Relationship Management (CRM) - Identifying duplicate customer records in CRM databases to improve data quality and avoid redundancy.
Healthcare systems - Detecting duplicate patient records in hospital management systems to ensure accurate medical history and billing.
E-commerce platforms - Identifying duplicate product listings or seller profiles to maintain a clean catalog and enhance the user experience.
Real estate and property management - Spotting duplicate listings of properties or tenants in property management systems.

Conclusion: Elasticsearch for deduplication pipeline

What we've built here demonstrates how combining Elasticsearch's phonetic capabilities with local LLM processing creates a robust deduplication pipeline that addresses real-world complexity.

Firstly, we prepared the cluster with the required dataset and hosted a local model. Then, to find a similar match, we queried Elasticsearch using key entities like name, address, and address variations. Later, the Elastic response was passed as context to the model for duplication analysis. Based on the instructions, the model decided which record from Elastic was a possible duplicate.

Remember that duplicate detection is not a one-time project–it's an ongoing process that improves with experience. The AI components learn from feedback, the search algorithms can be refined based on results, and the system becomes more accurate over time.

By implementing Elasticsearch for these use cases, businesses can stay ahead of the curve, ensuring accuracy, compliance, and a competitive edge in a rapidly evolving market.

Report an issue