We’ve gone on quite a journey with the semantic_text
field type, and this latest iteration promises to make semantic search simpler than ever. In addition to streamlining semantic_text
representation in _source
, benefits include reduced verbosity, more efficient disk utilization, and better integration with other Elasticsearch features. You can now use highlighting to retrieve the chunks most relevant to your query. And perhaps best of all, it is now a generally available (GA) feature!
Semantic search evolution
Our approach to semantic search has evolved over the years, with the goal of making it as simple as possible. Prior to the semantic_text
field type, performing semantic search required manually:
- Configuring your mappings to be compatible with your embeddings.
- Configuring an ingest pipeline to use a ML model to generate embeddings.
- Using the pipeline to ingest your docs.
- Using the same ML model at query time to generate embeddings for your query text.
We called this setup “easy” at the time, but we knew we could make it far simpler than this. Enter semantic_text
.
In the beginning
We introduced semantic_text
in Elasticsearch 8.15 with the goal of simplifying semantic search. If semantic_text
is new to you, we suggest reading our original blog post about it first for background about our approach.
We released semantic_text
as a beta feature first for a good reason. It’s a well-known truism in software development that making something simple can be quite difficult, and semantic_text
is no exception. There are a lot of moving pieces behind the scenes that enable the magical semantic_text
experience. We wanted to take the time to make sure we had those pieces right before moving the feature out of beta. That time was well spent: We iterated on our original approach, adding features and streamlining storage to create a simpler, leaner version of semantic_text
that is more supportable in the long-term.
Our original implementation relied on modifying _source
to store inference results. This meant that semantic_text
fields had a relatively complex subfield structure:
This structure created a few problems:
- It was needlessly verbose. In addition to the original value, it contained metadata and chunk information, which made API responses hard to read and larger than necessary.
- It increased index sizes on disk. Embeddings, which can be quite large, were effectively being stored twice: once in the Lucene index for retrieval purposes and again in
_source
. This significantly impacted the ability to usesemantic_text
at scale for larger datasets. - It was unintuitive to manage. The original value provided was under the
text
subfield, which meant special handling was required to get this value for follow-up actions. This meant thatsemantic_text
field values didn’t act like other field values in thetext
family, which had numerous knock-on effects that complicated our efforts to integrate it into higher-level workflows.
Semantic text as text
Our revised implementation elegantly improves on those friction points with a focused simplification in approach to how we represent semantic_text
in _source
. Instead of using a complex subfield structure to store metadata and chunk information directly within the semantic_text
field, we use a hidden metafield for this purpose. This means we no longer need to modify _source
to store inference results. In practical terms, it means that the document _source
that you provide to us for indexing is the same _source
that you will get back upon document retrieval.
Notice that there are no longer subfields like text
or inference
in the _source
representation. Instead, the _source
is as you provided it. So much simpler!
🚨 Note that if you parse semantic_text
field values returned in search results or Get APIs, this is a breaking change. That is to say, if you parse the infer_field.text
subfield value, you will need to update your code to instead parse the infer_field
value. We try our best to avoid breaking changes, but this was an unavoidable side-effect of removing the subfield structure from _source
.
There are numerous benefits to this _source
representation simplification:
- Simpler to work with. You no longer need to parse a complex subfield structure to get the original text value, you can just take the field value as the original value.
- Less verbose. Metadata and chunk information does not clutter up API responses.
- More efficient disk utilization. Embeddings are no longer stored in
_source
. - Better integration. It allows
semantic_text
to integrate better with other Elasticsearch features, such as multi-fields, partial document updates, and reindexing.
Let’s expand on that last point a bit because it covers a few areas. With this simplification, semantic_text
fields can now be used as the source and target of multi-fields:
Semantic_text
fields now also support partial document updates through the Bulk API:
And you can now reindex into a semantic_text
field that uses a different inference_id
:
Semantic highlighting
One of the most requested semantic_text
features is the ability to retrieve the most relevant chunks within a semantic_text
field. This functionality is critical for RAG use cases. Up until now, we have (unofficially) accommodated this with some hacky workarounds involving inner_hits
. However, we are retiring inner_hits
in favor of a more streamlined solution: highlighting.
Highlighting is a well-known lexical search technique one can apply to text
fields. As a member of the text
field family, it only makes sense to adapt the technique for semantic_text
. To this end, we have added a semantic
highlighter that you can use to retrieve the chunks that are most relevant to your query:
See the semantic_text documentation for more information about how to use highlighting.
Ready for primetime
With the _source
representation change in place, we are now officially announcing that semantic_text
is a generally available (GA) feature 🎉! This means that we are committed to not making any more breaking changes to the feature and supporting it in production environments. As a customer, you should feel comfortable integrating semantic_text
into your production workflows knowing that Elastic is committed to supporting you and providing long-term continuity.
Migrating from beta
To enable an orderly migration from the beta implementation, all indices with semantic_text
fields created in Elasticsearch 8.15 to 8.17 or created in Serverless prior to January 30th will continue to operate as they do today. That is to say, they will continue to use the beta _source representation. We recommend migrating to the GA _source
representation at your earliest convenience. You can do so by reindexing into a new index:
Note the use of the script
param to account for the _source
representation change. The script is taking the value from the text
subfield and assigning it directly to the semantic_text
field value.
Try it out yourself
These changes will be available in stack hosted Elasticsearch 8.18+, but if you want to try them today, they are already available in Serverless. They also pair well with semantic search simplifications we are rolling out at the same time. Use both to take semantic search to the next level!
Try out vector search for yourself using this self-paced hands-on learning for Search AI. You can start a free cloud trial or try Elastic on your local machine now.