3 ingestion tips to change your search game forever

Check out the different ways to ingest data into Elasticsearch and dive into practical examples to try something new.

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Start a free trial now.

Elasticsearch's flexibility is great for building customized search solutions. But that same flexibility means there's plenty of room to make mistakes if you're not careful. Whether you're starting fresh or tuning an existing setup, implementing smart strategies early on can save you a lot of time and trouble later.

In this article, we'll cover 3 key tips: data pre-processing (or "massaging"), data enrichment, and choosing the right field types to help you instantly boost your search system's performance and avoid common pitfalls.

Practical use case setup

To see these tips in action, let's consider a common use case: a social media analytics platform. In this case, each document contains data from a post. The data definition is below.

Index mapping: This mapping contains all the field definitions needed to follow the tips presented on this blog.

Sample document: This is what the document originally looked like.

The tips

Massage fields

“Massaging” fields means pre-processing data to prepare it for search. The reasons for doing this are:

Additional capabilities (e.g., search arrays)
Performance improvements (e.g., computing fields)

Convert string lists into actual arrays

Converting string lists into actual arrays allows us to filter or aggregate documents correctly; for example, suppose we have a document with this keyword field:

With this type of field, we can’t filter each hashtag separately; we have to match the whole string exactly. Therefore, this query won’t bring back the document:

On the other hand, if we split the hashtags to get the field into this form:

The query actually brings back the document because there is an exact term, #ingest_tips, that matches!

We can define our split processor in an ingest pipeline like this:

Notice that we can define different fields for the source (field parameter) and the result (target_field parameter). We can also define any character (or pattern) to split the string.To use this pipeline, we can define it as the default pipeline of our index. This way, any document that is indexed will go through the pipeline, and the data will be ready to be used immediately.

Other options to run the pipeline include an _update_by_query request or a reindex operation specifying the pipeline to use.

Precompute fields

When analyzing a dataset to calculate metrics, some common operations will appear repeatedly. For example, to calculate the total engagements, we have to add likes + comments + shares. We can do this every time we need to query this information (every time we search) or only once when indexing a document (pre-computing). The latter approach significantly improves performance by reducing the query time.

To do this, we can define a script processor that executes the operation and sets the total value as a part of the ingestion process. That way, we end up with a document that looks like this:

We can again define an ingest pipeline like this:

We also add an if parameter to make sure all relevant fields exist.

Precompute ranges as single fields

In order to accelerate search, we can compute category fields based on our current data. In our example, we will create a follower_tier field to classify the post’s creator based on their number of followers.

Before computing our new category field, let’s take a look at the query. We want to obtain posts from medium-sized creators, which we will define as having between 10001 and 100000 followers.

For this, we can use a range query. But, we have to be aware of our definition every time we need to apply this filter, and the query will be slower than a term filter.

Now, let’s define 3 follower tiers in a script processor:

Now our documents have a new keyword field called follower_tier with the precomputed follower_tier category:

And we can have a much faster and easier-to-use term query to filter these creators:

Term queries are generally faster than range queries because they involve direct lookups. By converting ranges into single fields, we can leverage this speed advantage.

Enrich data

Data enrichment means using external sources to expand the data in an index, adding context and depth to the indexed documents.

Enrich pipeline

An enriched pipeline enhances documents in one index by using data from another one during ingestion. This simplifies data management by centralizing additional information, such as enriching various indices from a dedicated source, allowing for consistent queries on the enriched data.

For our example, we will enrich a post with demographic data of the creator from a different index.

1. Create a user_demographics index with data like:

2. Create and execute an enrich policy: The enrich policy defines how the data relates to the documents we want to enrich.

For this example, we are matching the user_id of the post index with the demographics index to enrich incoming documents with all the other fields.

To execute the policy, we run:

Now, we’ll create an ingest pipeline that uses this policy:

3. Test the new policy. We can see that any incoming post documents with a matching user_id will be enriched with demographic data.

The original document:

The enriched result:

Inference pipeline

An inference pipeline allows us to use Machine Learning (ML) models deployed on ML nodes inside an Elasticsearch cluster to generate inferred data based on our documents. For our example, we will use the lang_ident_model_1 model, which is included out-of-the-box in Elasticsearch and it is used for identifying language from text.

1. Create an ingest pipeline with an inference processor to use the model:

Notice how we define the target field (where the result will be stored), and the source field (content) in the field map.

2. Apply the pipeline to our data.

Original document:

See the result with the inferred field:

Our result (en) is stored on post_language.predicted_value.

Let’s try with another example!

The original document:

The result with inferred field:

Here, the model correctly classifies the model as Spanish!

In general, this feature is worth exploring because it enables very interesting use cases such as:

LLM document summaries using the completion task type
Sparse vector queries (with Elastic’s ELSER inference endpoint)
Named entity Recognition (NER)
Text classification (used on sentiment analysis for example)
Embeddings generation (used on semantic search)

If you need to use a different model, check out Elastic’s Eland; it plays very nicely with HuggingFace!

Bonus tip: Remember, you can define multiple processors in the same ingest pipeline that do not have to be necessarily related. So all the examples we’ve created here can be used in a single pipeline!

Pick the right field type

Choosing the right field type is often overlooked but can have a significant impact on performance and functionality. Elasticsearch offers more than 40 field data types and each has its own strengths. Many of them are purely performance-based, like picking a number type based on the length of the values, while others provide additional encapsulated features. Did you know that:

You can search by ip ranges, and use masks when using the IP field type?
search_as_you_type can help you implement type ahead with no effort?
percolator helps you build alert systems?
semantic_text is OOTB semantic search?
rank_features can store numbers and use them as relevance indicators?

And more! You can take a look at the docs.

Rank_feature example

For this article, let’s use the likes of a post to boost it in the search results. To achieve this, we can define a rank_feature type field and run a rank_feature query. This will add a very useful feature to our queries with a much smaller impact on performance than a function score query.

1. Define the correct mapping: Since we potentially want to use the likes count in other queries, we define the rank_feature as a multifield called “ranking.”

Only the relevant mappings are being shown here.

2. Index documents with the “likes” field: Remember, our mapping definition already takes care of populating all multifields, so we only have to define the main field: likes.

3. Run a rank_feature query: This usually will go inside a should clause to just affect the scoring and not the resulting documents.

Here the expectation is that posts with more likes get an extra boost, so we can show more popular posts first, even if the query is a better match for other posts. Note that we can control how much the scoring is affected by the number of likes; we can even affect it negatively if we define boost to be under 1.0:

Relevance is subjective! It depends on who your users are, what they are searching for, how they search, what your data is, your business needs, etc. Learn more about relevance.

The idea here is to have more complex queries that also contribute to the overall ranking, and the rank_feature query works over that ranking to yield better results.

Conclusion

In this article, we explored 3 key tips for optimizing Elasticsearch: pre-processing data (massaging fields), enriching data with external sources, and selecting appropriate field types.

These tips apply to different use cases and leave the door open for you to develop your own ingest pipelines, streamline modern NLP techniques to create new features, and apply the best practices while working with Elasticsearch.

By applying these tips, you can build a more robust and efficient search system, making querying and indexing easier and more maintainable. Try implementing these tips in your own Elasticsearch deployments and see the difference.

Report an issue