This is a cache of https://www.elastic.co/observability-labs/blog/dna-of-data. It is a snapshot of the page at 2024-12-12T00:49:40.691+0000.
The DNA of DATA Increasing Efficiency with the Elastic Common Schema — Elastic Observability Labs
Peter Titov

The DNA of DATA Increasing Efficiency with the Elastic Common Schema

Elastic ECS helps improve semantic conversion of log fields. Learn how quantifying the benefits of normalized data, not just for infrastructure efficiency, but also data fidelity.

9 min read
The DNA of DATA Increasing Efficiency with the Elastic Common Schema

The Elastic Common Schema is a fantastic way to simplify and unify a search experience. By aligning disparate data sources into a common language, users have a lower bar to overcome with interpreting events of interest, resolving incidents or hunting for unknown threats. However, there are underlying infrastructure reasons to justify adopting the Elastic Common Schema.

In this blog you will learn about the quantifiable operational benefits of ECS, how to leverage ECS with any data ingest tool, and the pitfalls to avoid. The data source leveraged in this blog is a 3.3GB Nginx log file obtained from Kaggle. The representation of this dataset is divided into three categories: raw, self, and ECS; with raw having zero normalization, self being a demonstration of commonly implemented mistakes observed from my 5+ years of experience working with various users, and finally ECS with the optimal approach of data hygiene.

This hygiene is achieved through the parsing, enrichment, and mapping of data ingested; akin to the sequencing of DNA in order to express genetic traits. Through the understanding of the data's structure, and assigning the correct mapping, a more thorough expression may be represented, stored and searched upon.

If you would like to learn more about ECS, the dataset used in this blog, or available Elastic integrations, please be sure to check out these related links:

Dataset Validation

Before we begin, let us review how many documents exist and what we're required to ingest. We have 10,365,152 documents/events from our Nginx log file:

With 10,365,152 documents in our targeted end-state:

Dataset Ingestion: Raw & Self

To achieve the raw and self ingestion techniques, this example is leveraging Logstash for simplicity. For the raw data ingest, a simple file input with no additional modifications or index templates.


    input {
      file {
      id => "NGINX_FILE_INPUT"
      path => "/etc/logstash/raw/access.log"
      ecs_compatibility => disabled
      start_position => "beginning"
      mode => read
      }
    }
    filter {
    }
    output {
      elasticsearch {
        hosts => ["https://mycluster.es.us-east4.gcp.elastic-cloud.com:9243"]
          index => "nginx-raw"
          ilm_enabled => true
          manage_template => false
          user => "username"
          password => "password"
          ssl_verification_mode => none
          ecs_compatibility => disabled
          id => "NGINX-FILE_ES_Output"
      }
    }

For the self ingest, a custom Logstash pipeline with a simple Grok filter was created with no index template applied:

    input {
      file {
        id => "NGINX_FILE_INPUT"
        path => "/etc/logstash/self/access.log"
        ecs_compatibility => disabled
        start_position => "beginning"
        mode => read
      }
    }
    filter {
      grok {
        match => { "message" => "%{IP:clientip} - (?:%{NOTSPACE:requestClient}|-) \[%{HTTPDATE:timestamp}\] \"(?:%{WORD:requestMethod} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})\" (?:-|%{NUMBER:response}) (?:-|%{NUMBER:bytes_in}) (-|%{QS:bytes_out}) %{QS:user_agent}" }
      }
    }
    output {
      elasticsearch {
        hosts => ["https://myscluster.es.us-east4.gcp.elastic-cloud.com:9243"]
        index => "nginx-self"
        ilm_enabled => true
        manage_template => false
        user => "username"
        password => "password"
        ssl_verification_mode => none
        ecs_compatibility => disabled
        id => "NGINX-FILE_ES_Output"
      }
    }

Dataset Ingestion: ECS

Elastic comes included with many available integrations which contain everything you need to achieve to ensure that your data is ingested as efficiently as possible.

For our use case of Nginx, we'll be using the associated integration's assets only.

The assets which are installed are more than just dashboards, there are ingest pipelines which not only normalize but enrich the data while simultaneously mapping the fields to their correct type via component templates. All we have to do is make sure that as the data is coming in, that it will traverse through the ingest pipeline and use these supplied mappings.

Create your index template, and select the supplied component templates provided from your integration.

Think of the component templates like building blocks to an index template. These allow for the reuse of core settings, ensuring standardization is adopted across your data.

For our ingestion method, we merely point to the index name that we specified during the index template creation, in this case,

nginx-ecs
and Elastic will handle all the rest!

    input {
      file {
      id => "NGINX_FILE_INPUT"
      path => "/etc/logstash/ecs/access.log"
      #ecs_compatibility => disabled
      start_position => "beginning"
      mode => read
      }
    }
    filter {
    }
    output {
      elasticsearch {
        hosts => ["https://mycluster.es.us-east4.gcp.elastic-cloud.com:9243"]
        index => "nginx-ecs"
        ilm_enabled => true
        manage_template => false
        user => "username"
        password => "password"
        ssl_verification_mode => none
        ecs_compatibility => disabled
        id => "NGINX-FILE_ES_Output"
      }
    }

Data Fidelity Comparison

Let's compare how many fields are available to search upon the three indices as well as the quality of the data. Our raw index has but 15 fields to search upon, with most being duplicates for aggregation purposes.

However from a Discover perspective, we are limited to

6
fields!

Our self-parsed index has 37 available fields, however these too are duplicated and not ideal for efficient searching.

From a Discover perspective here we have almost 3x as many fields to choose from, yet without the correct mapping the ease of which this data may be searched is less than ideal. A great example of this, is attempting to calculate the average bytes_in on a text field.

Finally with our ECS index, we have 71 fields available to us! Notice that courtesy of the ingest pipeline, we have enriched fields of geographic information as well as event categorial fields.

Now what about Discover? There were 51 fields directly available to us for searching purposes:

Using Discover as our basis, our self-parsed index has 283% more fields to search upon whereas our ECS index has 850%! 

Storage Utilization Comparison

Surely with all these fields in our ECS index the size would be exponentially larger than the self normalized index, let alone the raw index? The results may surprise you.

Accounting for the replica of data of our 3.3GB size data set, we can see that the impact of normalized and mapped data has a significant impact on the amount of storage required.

Conclusion

While there is an increase in the amount required storage for any dataset that is enriched, Elastic provides easy solutions to maximize the fidelity of the data to be searched while simultaneously ensuring operational storage efficiency; that is the power of the Elastic Common Schema.

Let's review how we were able to maximize search, while minimizing storage

  • Installing integration assets for our dataset that we are going to ingest.
  • Customizing the index template to leverage the included components to ensure mapping and parsing are aligned to the Elastic Common Schema.

Ready to get started? Sign up for Elastic Cloud and try out the features and capabilities I've outlined above to get the most value and visibility out of your data.

Share this article