Implementing semantic search for the e-commerce use case can be tricky since product formatting is far from standardized. In this blog, we tackle this challenge by using query profiles in Elastic - a method that takes multiple metadata fields and transforms them into a piece of text that resembles a user's preferences and requests. Through a practical example, we demonstrate how query profiles can improve e-commerce search.
Introduction
Elasticsearch is naturally suited for e-Commerce data, by which I mean large quantities of product definitions, like this Amazon product dataset. Let's download the sample file, with 10,000 products, and upload the CsV to an Elastic index (I'm using my Elastic Cloud Deployment) called amazon_product_10k.
When we look at the data, we see product descriptions like this one, about a superhero themed bobblehead for a character called Black Lightning:
{
"_index": "amazon_product_10k_plain_embed",
"_id": "F-Qi2JIBnZufN_5vn-sr",
"_version": 1,
"_score": 0,
"_ignored": [
"Image.keyword",
"product_specification.keyword",
"technical_details.keyword"
],
"_source": {
"selling_price": "$17.75",
"Category": "Toys & Games | Collectible Toys | statues, Bobbleheads & Busts | statues",
"shipping_weight": "3.7 pounds",
"product_specification": "ProductDimensions:3x3x12.4inches|ItemWeight:2pounds|shippingWeight:3.7pounds(Viewshippingratesandpolicies)|Domesticshipping:ItemcanbeshippedwithinU.s.|Internationalshipping:ThisitemcanbeshippedtoselectcountriesoutsideoftheU.s.LearnMore|AsIN:B077sCH3B2|Itemmodelnumber:DEC170420|Manufacturerrecommendedage:15yearsandup",
"is_amazon_seller": "Y",
"id": "b4358a38037a7e7fbcd7fe16970e7bff",
"model_number": "DEC170420",
"Image": "https://images-na.ssl-images-amazon.com/images/I/41vJ5amvKeL.jpg|https://images-na.ssl-images-amazon.com/images/I/31tr4qwqZmL.jpg|https://images-na.ssl-images-amazon.com/images/I/31-%2BMGqJAsL.jpg|https://images-na.ssl-images-amazon.com/images/I/31sgD4%2B0HlL.jpg|https://images-na.ssl-images-amazon.com/images/I/51HbaURPW8L.jpg|https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel.jpg",
"product_name": "DC Collectibles DCTV: Black Lightning Resin statue",
"about_product": "Make sure this fits by entering your model number. | From the upcoming DCTV series on The CW | Limited Edition of 5,000 | Measures approximately 12.42\" tall | sculpted by Alterton",
"url": "https://www.amazon.com/DC-Collectibles-DCTV-Lightning-statue/dp/B077sCH3B2",
"technical_details": "show up to 2 reviews by default Jefferson Pierce returns to the superhero fold as Black Lightning, the star of The CW's upcoming TV series Black Lightning! Limited edition of 5,000. Measures approximately 12.42\" tall. sculpted by Alterton. | 3.7 pounds (View shipping rates and policies)"
},
}
A search use-case here involves the user looking for a product, making with a request like:
superhero bobbleheads
For enthusiasts of semantic search, there is an immediate problem. The main source of searchable text is the product description, which looks like this:
Make sure this fits by entering your model number.
| From the upcoming DCTV series on The CW
| Limited Edition of 5,000
| Measures approximately 12.42\" tall
| sculpted by Alterton
Which tells us nothing about the product. A naive approach might be to choose an embedding model, embed the description and then do semantic search over it. This particular product would never appear for a query like "superhero bobbleheads", regardless of the choice of embedding. Let's try this and see what happens.
Naive semantic search
Go ahead and deploy elser_v2 with this command (Make sure ML node autoscaling is enabled):
PUT _inference/sparse_embedding/elser_v2
{
"service": "elser",
"service_settings": {
"num_allocations": 4,
"num_threads": 8
}
}
And let's define a new index called amazon_product_10k_plain_embed, define the product description as a semantic_text type using our elser_v2 inference endpoint, and run a reindex:
PUT amazon_product_10k_plain_embed
{
"mappings": {
"properties": {
"about_product": {
"type": "semantic_text",
"inference_id": "elser_v2"
}
}
}
}
POsT _reindex?slices=auto&wait_for_completion=false
{
"conflicts": "proceed",
"source": {
"index": "amazon_product_10k",
"size": 64
},
"dest": {
"index": "amazon_product_10k_plain_embed"
}
}
And run a semantic search:
GET amazon_product_10k_plain_embed/_search
{
"_source": ["about_product.text", "technical_details", "product_name"],
"retriever": {
"standard": {
"query": {
"nested": {
"path": "about_product.inference.chunks",
"query": {
"sparse_vector": {
"inference_id": "elser_v2",
"field": "about_product.inference.chunks.embeddings",
"query": "superhero bobblehead"
}
}
}
}
}
},
"size": 20
}
Behold. The product_names we get back are pretty bad.
1. Idea Max Peek-A-Pet Bobble Heads Flowers Corgi (Tea Cup)
2. Mezco Toyz sons Of Anarchy 6" Clay Bobblehead
3. Funko Marvel Captain America Pop Vinyl Figure
Corgi is not a superhero despite being a bobblehead. The sons of Anarchy are not superheroes, they are motorcycle enthusiasts, and a Funko Pop is certainly not a bobblehead. so why are the results so awful?
The required information is actually in the category and technical_details fields:
"Category": "Toys & Games | Collectible Toys | statues, Bobbleheads & Busts | statues",
"technical_details": "show up to 2 reviews by default Jefferson
Pierce returns to the superhero fold as Black Lightning, the
star of The CW's upcoming TV series Black Lightning!
Limited edition of 5,000. Measures approximately 12.42\"
tall. sculpted by Alterton. | 3.7 pounds
(View shipping rates and policies)"
second problem, only the category tells us that this product is a bobblehead, and only the technical_details field tells us that this is superhero related. so the next naive thing to do is embed all three fields, then do a vector search over all three, and hope that the average score will place this product near the top of the results.
Other than the obvious tripling of compute and storage cost, we're also taking a leap of faith that the resulting trio of embeddings will not be noisy, because the product description is very irrelevant, and the category and technical_details each only contain one word relevant to the search query.
Naive semantic search even harder
Let's just try it anyway and see what happens. Let's embed the three fields:
PUT amazon_product_10k_triple_embed_3
{
"mappings": {
"properties": {
"about_product": {
"type": "semantic_text",
"inference_id": "elser_v2"
},
"technical_details": {
"type": "semantic_text",
"inference_id": "elser_v2"
},
"Category": {
"type": "semantic_text",
"inference_id": "elser_v2"
}
}
}
}
POsT _reindex?slices=auto&wait_for_completion=false
{
"conflicts": "proceed",
"source": {
"index": "amazon_product_10k",
"size": 64
},
"dest": {
"index": "amazon_product_10k_triple_embed_3"
}
}
And run another search using retrievers with Elastic's built-in reciprocal rank fusion.
GET amazon_product_10k_triple_embed_3/_search
{
"retriever": {
"rrf": {
"retrievers": [
{
"standard": {
"query": {
"nested": {
"path": "about_product.inference.chunks",
"query": {
"sparse_vector": {
"inference_id": "elser_v2",
"field": "about_product.inference.chunks.embeddings",
"query": "superhero bobblehead"
}
},
"inner_hits": {
"size": 2,
"name": "amazon_product_10k_triple_embed_3.about_product",
"_source": [
"about_product.inference.chunks.text"
]
}
}
}
}
},
{
"standard": {
"query": {
"nested": {
"path": "Category.inference.chunks",
"query": {
"sparse_vector": {
"inference_id": "elser_v2",
"field": "Category.inference.chunks.embeddings",
"query": "superhero bobblehead"
}
},
"inner_hits": {
"size": 2,
"name": "amazon_product_10k_triple_embed_3.Category",
"_source": [
"Category.inference.chunks.text"
]
}
}
}
}
},
{
"standard": {
"query": {
"nested": {
"path": "technical_details.inference.chunks",
"query": {
"sparse_vector": {
"inference_id": "elser_v2",
"field": "technical_details.inference.chunks.embeddings",
"query": "superhero bobblehead"
}
},
"inner_hits": {
"size": 2,
"name": "amazon_product_10k_triple_embed_3.technical_details",
"_source": [
"technical_details.inference.chunks.text"
]
}
}
}
}
}
]
}
}
}
Behold, the results are actually worse than before:
1. sunny Days Entertainment Bendems Collectible Posable Figures - Bobs Burgers: Bob
2. star Wars Childs Boba Fett Costume, Medium
3. DIAMOND sELECT TOYs Batman The Animated series: Ra's Al Ghul Resin Bust Figure
I have a colleague who would probably really enjoy Bob from Bob's Burgers coming out as top result for superhero merchandise @Jeff Vestal.
so what now?
HyDE
Here's a thought. Let's use an LLM to improve the quality of our data, and make vector search a little bit more effective. There is a technique called HyDE. The proposal is pretty intuitive. The query contains key content, namely the keywords "superhero" and "Bobblehead". However, it does not capture the form and structure of the documents we are actually searching over. In other words, search queries do not resemble indexed documents in form, though they may have shared content. so with both keyword and semantic search, we match content to content, but we do not match form to form.
HyDE uses an LLM to transform queries into hypothetical documents, which capture relevance patterns but do not contain actual real content that would answer the query. The hypothetical document is then embedded and used for vector search. In short, we match form with form, as well as content with content.
Let's modify the idea a little bit for e-Commerce.
Query profiles
What I call query profiles is really about taking multiple metadata fields, and transforming them into a piece of text that resembles a user's preferences and likely requests. This Query Profile is then embedded, and subsequent vector searches are done on it. The LLM is instructed to create a document that mimics what a user might ask for when searching for a product. The flow looks like this:
I think that there are two main advantages to this method:
- Consolidate information from multiple fields into a single document.
- Capture the likely forms of a user's request, and covering as many bases as possible when doing so.
The resulting text is information rich and might give us better results when searched over. Let's implement it and see what happens.
Implementing query profiles
We will define a pipeline in Elasticsearch with an LLM processor. I'm going to make use of GPT-4o mini from my company's Azure OpenAI deployment, so let's define the inference endpoint like so:
PUT _inference/completion/azure_openai_gpt4omini_completion
{
"service": "azureopenai",
"service_settings": {
"api_key": <YOUR API KEY>
"resource_name": <YOUR REsOURCE NAME>,
"deployment_id": "gpt-4o-mini",
"api_version": "2024-06-01"
}
}
Now let's define an ingest pipeline containing the query profile prompt.
PUT _ingest/pipeline/amazon_10k_query_profile_pipeline
{
"processors": [
{
"script": {
"source": """
ctx.query_profile_prompt = 'Given a {product}, create a detailed query
profile written from a customers perspective describing what they are
looking for when shopping for this exact item. Include key characteristics
like type, features, use cases, quality aspects, materials, and target user.
Focus on aspects a shopper would naturally mention in their search query.
Format: descriptive text without bullet points or sections.
Example: "Looking for a high-end lightweight carbon fiber road bike for
competitive racing with electronic gear shifting and aerodynamic frame
design suitable for experienced cyclists who value performance and speed."
Describe this product in natural language that matches how real customers
would search for it.
Here are the product details:
\\n Product Name:\\n' + ctx.product_name
+ '\\nAbout Product:\\n' + ctx.about_product
+ '\\nCategory:\\n' + ctx.category
+ '\\nTechnical Details:\\n' + ctx.technical_details
"""
}
},
{
"inference": {
"model_id": "azure_openai_gpt4omini_completion",
"input_output": {
"input_field": "query_profile_prompt",
"output_field": "query_profile"
},
"on_failure": [
{
"set": {
"description": "Index document to 'failed-<index>'",
"field": "_index",
"value": "failed-{{{ _index }}}"
}
}
]
}
},
{
"remove": {
"field": "query_profile_prompt"
}
}
]
}
We'll run a reindex to create the new field with our LLM integration:
POsT _reindex?slices=auto&wait_for_completion=false
{
"conflicts": "proceed",
"source": {
"index": "amazon_product_10k",
"size": 32
},
"dest": {
"index": "amazon_product_10k_w_query_profiles",
"pipeline": "amazon_10k_query_profile_pipeline",
"op_type": "create"
}
}
And once it's done, we'll define another index with the query profiles set to semantic_text, and run embedding with Elser. I like to split up the processing into two stages, so I can hang on to the fruits of LLM's labor separately from the embeddings. Call this insurance against an act of God.
PUT amazon_product_10k_query_embed
{
"mappings": {
"properties": {
"query_profile": {
"type": "semantic_text",
"inference_id": "elser_v2"
}
}
}
}
Let's now run the same query again, this time using semantic search over the query profiles, and see what we get:
GET amazon_product_10k_qp_embed/_search
{
"_source": ["about_product.text", "technical_details.text", "product_name"],
"retriever": {
"standard": {
"query": {
"nested": {
"path": "query_profile.inference.chunks",
"query": {
"sparse_vector": {
"inference_id": "elser_v2",
"field": "query_profile.inference.chunks.embeddings",
"query": "superhero bobblehead"
}
}
}
}
}
},
"size": 20
}
The results are quite a bit better, actually.
1. FOCO DC Comics Justice League Character Bobble, superman
2. The Tin Box Company Batman Bobble Head Bank, Black
3. Potato Head MPH Marvel Mashup Hawkeye & Iron Man Toy
Okay, result number 3 is a potato head. Fine. But 1 and 2 are actual superhero bobble heads, so I'm going to take this as a win.
Conclusion
Implementing semantic search for an e-Commerce use-case brings unique challenges. The distributed nature of it means that product formatting is far from standardized. As such, embedding specific fields and trying to do semantic search isn't going to work as well as it does for searching articles or other "full" texts.
By using an LLM to create a Query Profile, we can create a piece of text that merges information from multiple fields and simultaneously captures what a user is likely to search for.
By doing this, we change the e-Commerce search problem so that it resembles a typical knowledge base search, which is a problem statement where semantic search performs quite well.
I think this provides a pathway to bringing traditional search use-cases with structured or tabular data into the semantic/LLM-era, and that's pretty cool.
Elasticsearch is packed with new features to help you build the best search solutions for your use case. Dive into our sample notebooks to learn more, start a free cloud trial, or try Elastic on your local machine now.