This is a cache of https://www.elastic.co/search-labs/blog/efficient-bitwise-matching-in-elasticsearch. It is a snapshot of the page at 2024-10-24T00:28:07.987+0000.
Efficient bitwise matching in Elasticsearch - Search Labs

Efficient bitwise matching in Elasticsearch

Explore six approaches to encoding and matching binary data in Elasticsearch, including terms encoding (my preferred approach), boolean encoding, sparse bit position encoding, integer encoding with exact matching, integer encoding with scripted bitwise matching, and integer encoding with ESQL for bitwise matching, with practical examples and use cases.

Introduction

Binary encoding is an important technique in modern applications, especially in areas like IoT device monitoring, where vast amounts of binary sensor data or operational flags are continuously processed. Efficiently managing and searching through such data is essential for real-time analytics and decision-making. To achieve this, bitwise matching is a powerful tool for filtering based on binary values, allowing for precise data extraction. With the right data modeling, Elasticsearch not only supports bitwise matching but can do so with high performance.

At the time of writing this article, Elasticsearch doesn’t have a native operator for bitwise matching, and lucene doesn’t directly support bitwise matching. To overcome this limitation, this article presents six approaches for binary encoding and bitwise matching in Elasticsearch: terms encoding (my preferred approach), boolean encoding, sparse bit position encoding, integer encoding with exact matching, integer encoding with scripted bitwise matching, and integer encoding with ESQL for bitwise matching.

Terms encoding

Using terms for binary representation leverages Elasticsearch’s optimized term-based queries. This method involves representing each bit as a term which is stored in a keyword field.

Advantages of Terms Encoding

The terms-based approach allows for Elasticsearch to make use of optimized data structures, resulting in efficient querying even for large datasets. Additionally, this method scales well in situations where different combinations of bits need to be queried frequently, as each bit is treated as an independent entity that can be indexed and searched.

Disadvantages of Terms Encoding

This method requires that data is pre-processed to get it into the terms encoding format before storing it in Elasticsearch. Additionally bitwise queries would have to be constructed as a series of terms matches as demonstrated below. Furthermore, because each binary sequence is represented by multiple terms, this will likely occupy more storage than an integer representation would require.

Setting up the environment for terms encoding

Define the mapping for keyword representation:

PUT test_terms_encoding
{
 "mappings": {
   "properties": {
     "terms_encoded_bits": {
       "type": "keyword"
     }
   }
 }
}

Indexing documents with terms encoding

Index documents with their binary bit representations:

POST test_terms_encoding/_doc/1
{
 "terms_encoded_bits": ["b3=0", "b2=1", "b1=1", "b0=0"] // binary 0110
}
POST test_terms_encoding/_doc/2
{
 "terms_encoded_bits": ["b3=1", "b2=0", "b1=1", "b0=0"] // binary 1010
}

Querying using terms encoding

To query documents where b3 is true, and b0 is false (i.e the doc where _id=2 above)

GET test_terms_encoding/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "terms_encoded_bits": "b3=1"
          }
        },
        {
          "term": {
            "terms_encoded_bits": "b0=0"
          }
        }
      ]
    }
  }
}

Boolean Encoding

This method uses separate boolean fields for each bit, providing clear and direct access to specific bits.

Advantages of Boolean Encoding

The Boolean Encoding method has all of the advantages of the “Terms Encoding” method and some may find this method to be more intuitive. It may also require slightly less storage for some data sets, as just a single boolean value is stored for each field rather than a string.

Disadvantages of Boolean Encoding

This has all of the disadvantages listed for the “Terms Encoding” approach. An additional disadvantage of this approach is that because we create a new field for each bit, this will contribute to mapping explosion.

Setting up the environment for boolean encoding

Define the mapping with boolean fields:

PUT test_boolean_encoding
{
  "mappings": {
    "properties": {
      "b3": { "type": "boolean" },
      "b2": { "type": "boolean" },
      "b1": { "type": "boolean" },
      "b0": { "type": "boolean" }
    }
  }
}

Indexing documents with boolean encoding

Index documents with boolean values for each bit:

POST test_boolean_encoding/_doc/1
{
 "b3": false, 
 "b2": true,
 "b1": true,
 "b0": false
} // binary 0110 – integer 6
POST test_boolean_encoding/_doc/2
{
 "b3": true,
 "b2": false,
 "b1": true,
 "b0": false
} // binary 1010 – integer 10

Querying using boolean encoding

To query documents where b3 is true, and b0 is false (i.e the doc where _id=2 above):

GET test_boolean_encoding/_search
{
 "query": {
   "bool": {
     "filter": [
       {
         "term": {
           "b3": true
         }
       },
       {
         "term": {
           "b0": false
         }
       }
     ]
   }
 }
}

Sparse Bit Position Encoding

This approach encodes only the positions of bits that are set to true in an array.

Advantages of Sparse Bit Position Encoding

If the vast majority of documents generally do not have any bits set to true, this could be more efficient in terms of storage versus the previous approaches. For example, you could imagine data sets where the bits represent warning flags should rarely be true, which would result in the bit position encoding arrays being empty (and therefore space-efficient) for a large percentage of documents.

Disadvantages of Sparse Bit Position Encoding

One drawback of this approach is that querying for bits that are not true would require using the must_not operator. However, using must_not for queries can lead to performance overhead since Elasticsearch needs to scan documents to verify the absence of certain values, which is less efficient than directly querying for the presence of specific terms. In large datasets, this could slow down processing. Another drawback is that if the data consistently has many bits set to true, then this would require a long series of integers for representation, which would increase the storage requirements. And finally, like the other approaches, this approach requires pre-processing your data to convert it to the sparse bit position encoding, prior storing it in Elasticsearch.

Setting up the environment for sparse bit position encoding

Define the mapping for arrays of integers:

PUT test_sparse_encoding
{
  "mappings": {
    "properties": {
      "sparse_bit_positions": {
        "type": "integer"
      }
    }
  }
}

Indexing documents with sparse bit position encoding

Index documents with the positions of the bits encoded as integers:

POST test_sparse_encoding/_doc/1
{
  "sparse_bit_positions": [2, 1] // binary 0110
}
POST test_sparse_encoding/_doc/2
{
  "sparse_bit_positions": [3, 1] // binary 1010
}

Querying using sparse bit position encoding

To query documents where b3 is true, and b0 is false (i.e the doc where _id=2 above):

GET test_sparse_encoding/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "sparse_bit_positions": 3
          }
        }
      ],
      "must_not": [
        {
          "term": {
            "sparse_bit_positions": 0
          }
        }
      ]
    }
  }
}

Integer Encoding with Exact Matching

In this approach, binary values are encoded into integers. This is an intuitive approach, especially when you need to store and query complete binary sequences (i.e. integers) efficiently.

Advantages of Integer Encoding with Exact Matching

Of the approaches discussed so far, this is the approach that is most likely to directly map to the way that data is stored in the source systems, which often represent binary sequences as integers. Therefore, storing documents with this approach may require less preprocessing than the other approaches.

Disadvantages of Integer Encoding with Exact Matching

This approach only discusses exact matching of integer values which represent binary sequences. It does not address bitwise matching within the integers. It also requires that you convert your binary values to integers before storing them in Elasticsearch.

Setting up the environment for integer encoding

Define the mapping for integer encoding:

PUT test_integer_encoding
{
 "mappings": {
   "properties": {
     "integer_representation": {
       "type": "integer"
     }
   }
 }
}

Indexing documents with integer encoding

Index documents by converting binary sequences into integers:

POST test_integer_encoding/_doc/1
{
 "integer_representation": 6  // binary 0110
}
POST test_integer_encoding/_doc/2
{
 "integer_representation": 10 // binary 1010
}

Querying using integer encoding

The following query retrieves documents where the binary representation equals a specific value – in this example, the doc which contains an integer representation of 0110 with _id=1:

GET test_integer_encoding/_search
{
 "query": {
   "term": {
     "integer_representation": 6 // binary 0110
   }
 }
}

Integer Encoding with Scripted Bitwise Matching

In this approach, we extend the concept of encoding binary values into integers and make use of scripted query functionality to query individual bits that are set within an integer value. This approach is discussed in this article for completeness, but keep in mind that extensive use of scripted queries will generate additional workload on the cluster, and may be slower and less efficient than alternative approaches.

Advantages of Integer Encoding with Scripted Bitwise Matching

This approach has the advantages of the “Integer Encoding with Exact Matching” approach. An additional advantage is that individual bits can be matched.

Disadvantages of Integer Encoding with Scripted Bitwise Matching

This approach to bitwise matching does not leverage the data structures that Elasticsearch builds to ensure fast and efficient queries. Therefore this approach is likely to result in slower queries that require more resources than the previously mentioned approaches. For this reason, I would generally recommend the previously discussed approaches.

Setting up and Indexing Documents

In this section we will use the same index that was populated in the second called “Integer Encoding with Exact Matching”.

Querying

To query documents where b3 is true, and b0 is false (i.e the doc where _id=2 above) we can use a scripted query. The following query satisfies these requirements, and is commented to explain the logic:

GET test_integer_encoding/_search
{
  "query": {
    "bool": {
      "filter": {
        "script": {
          "script": {
            "source": """
            // b3 is true 
            // i.e. "integer_representation" AND 1000 == 1000
            // Keep in mind that 1000 binary == 8 in integer
            ((doc['integer_representation'].value & 8) == 8) && 
            
            // b0 false 
            // i.e. "integer_representation" AND 0001 == 0
            ((doc['integer_representation'].value & 1) == 0)"""   
          }
        }
      }
    }
  }
}

Integer Encoding with ES|QL for Bitwise Matching

Like the “Integer Encoding with Scripted Bitwise Matching” approach, this approach can also match individual bits, but it leverages ES|QL rather than scripted queries. This approach is discussed in this article for completeness, but extensive use of this approach may generate additional workload on the cluster, and may be slower and less efficient than alternative approaches.

Advantages of Integer Encoding with ES|QL for Bitwise Matching

This approach has the same advantages as “Integer encoding with scripted bitwise matching”. An additional advantage it that it leverages ES|QL which is designed with performance in mind.

Disadvantages of Integer Encoding with ES|QL for Bitwise Matching

Even though this approach leverages ES|QL, it is unable to directly use pre-built data structures for bitwise matching. Therefore this approach is likely to result in slower queries that require more resources than many of the other approaches.

Setting up and Indexing Documents

In this section we will use the same index that was populated in the second called “Integer Encoding with Exact Matching”.

Querying

To query documents where b3 is true, and b0 is false (i.e the doc where _id=2 above) we can use ES|QL. The following query satisfies these requirements, and is commented to explain the logic:

POST /_query?format=txt
{
  "query": """
  FROM test_integer_encoding METADATA _index, _id
  // The following uses division to shift the bit we are interested 
  // in, to the right. ie. dividing by 8 (or b1000 - corresponding to b3) 
  // shifts b3 to the least significant bit position, or b0. 
  // Additionally, the rightmost bits are dropped.
  // Then the modulus by 2 checks if the new (post shift) 
  // value of b0 (formerly b3)is odd or even (1 or 0)
  | WHERE (integer_representation / 8 % 2 == 1)  // b3 is true
  | WHERE (integer_representation / 1 % 2 == 0)  // b0 is false
  | KEEP _id, integer_representation
  """
}

Conclusion

In this article, we explored six approaches for bitwise matching — terms encoding (my preferred approach), boolean encoding, sparse bit position encoding, integer encoding with exact matching, integer encoding with scripted bitwise matching, and integer encoding with ES|QL for bitwise matching. This demonstrated how different methods can be applied to efficiently handle bitwise matching in Elasticsearch. Each method has its advantages and trade-offs depending on the specific requirements of your application.

For scenarios requiring individual bit matching, the term-based and boolean field approaches work well and should be efficient. Representing true bit positions in arrays of integers provides a compact and flexible solution for sparse bit sequences. Encoding binary sequences as integers might be suitable for whole sequence operations but at the cost of losing the ability to efficiently query individual bits. We can also query individual bits within integers with scripted queries or with ES|QL, but these approaches are likely to be less efficient than the other approaches.

Acknowledgements

Thanks to Honza Kral for the idea of using terms and integers to encode individual bits, to Alexis Charveriat for suggesting the bitwise matching approach using ES|QL, and to Carly Richmod for making valuable suggestions during the technical review process.

Ready to try this out on your own? Start a free trial.

Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as your are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself