Live lo<strong>g</strong>s and prosper: fixin<strong>g</strong> a fundamental flaw in observability

SREs are often overwhelmed by dashboards and alerts that show what and where things are broken, but fail to reveal why. This industry-wide focus on visualizing symptoms forces engineers to manually hunt for answers. The crucial "why" is buried in information-rich logs, but their massive volume and unstructured nature has led the industry to throw them aside or treat them like a second-class citizen. As a result, SREs are forced to turn every investigation into a high-stress, time-consuming hunt for clues. We can solve this problem with logs, but unlocking their potential requires us to reimagine how we work with them and improve the overall investigations journey.

Observability, the broken promise

To see why the current model fails, let’s look at the all-too-familiar challenge every SRE dreads: knowing a problem exists but needing to spend valuable time just trying to find where to even start the investigation.

Imagine you get a Slack message from the support team: "a few high-value customers are reporting their payments are failing." You have no shortage of alerts, but most are just flagging symptoms. You don’t know where to start. You decide to check the logs to see if there is anything obvious, starting with the systems that have the high CPU alert.

You spend a few minutes searching and

grep

-ing through terabytes of logs for affected customer IDs, trying to piece together the problem. Nothing. You worry that you aren’t getting all the logs to reveal the problem, so you turn on more logging in the application. Now you’re knee-deep in data, desperately trying to find patterns, errors, or other "hints" that will give you a clue as to the why.

Finally, one of the broader log queries hits on an error code associated with an impacted customer ID. This is the first real clue. You pivot your search to this new error code and after an hour of digging, you finally uncover the error message. You've finally found the why, but it was a stressful, manual hunt that took far too much time and impacted dozens more customers.

This incident perfectly illustrates the broken promise of modern observability: The complete failure of the investigation process. Investigations are a manual, reactive process that SREs are forced into every day. At Elastic, we believe metrics, traces, and logs are all essential, but their roles, and the workflow between them, must be fundamentally re-imagined for effective investigations.

Observability is about having the clearest understanding possible of the what, where, and why. Metrics are essential for understanding the what. They are the heartbeat of your system, powering the dashboards and alerts that tell you when a threshold has been breached, like high CPU utilization or error rates. But they are aggregates; they show the symptom, rarely the root cause. Traces are good at identifying the where. They map the journey of a request through a distributed system, pinpointing the specific microservice or function where latency spikes or an error originates. Yet, their effectiveness hinges on complete and consistent code instrumentation, a constant dependency on development teams that can leave you with critical visibility gaps. Logs tell you the why. They contain all the rich, contextual, and unfiltered truth of an event. If we can more proactively and efficiently extract information from logs, we can greatly improve our overall understanding of our environments.

Challenges of logs in modern environments

While logs are in the standard toolbox, they have been neglected. SREs using today’s solutions deal with several major problems:

First, due to their unstructured nature, it’s very difficult to parse and manage logs so that they’re useful. As a result, many SRE teams spend a lot of time building and maintaining complex pipelines to help manage this process.

Second, logs can get expensive at high volume, which leads teams to drop them on the floor to control costs, throwing away valuable information in the process. Consequently, when an incident occurs, you waste precious time hunting for the right logs, and manually correlating across services.

Finally, nobody has built a log solution that proactively works to find the important signals in logs and to surface those critical whys to you when you need them. As a result, log-based investigations are too painful and slow.

Why are we here? As applications became more complex, log volume became unmanageable. Instead of solving this with automation, the industry took a shortcut: it gave up on getting the most out of logs and prioritized more manageable but less informative signals.

This decision is the origin of the broken, reactive model. It forced observability into a manual loop of 'observing' alerts, rather than building automation that could help us truly understand our systems to improve how we root cause and resolve issues. This has transformed SREs from investigators into full-time data wranglers, wrestling with grok patterns and fragile ETL scripts instead of solving outages.

Introducing Streams to rethink how you use logs for investigations

Streams is an agentic AI solution that simplifies working with logs to help SRE teams rapidly understand the why behind an issue for faster resolution. The combination of Elasticsearch and AI is turning manual management of noisy logs into automated workflows that identify patterns, context, and meaning, marking a fundamental shift in observability.

Log everything in any format

By applying the Elasticsearch platform for context engineering to bring together retrieval and AI-driven parsing to keep up with schema changes, we are reimagining the entire log pipeline.

Streams ingests raw logs from all your sources to a single destination. It then uses AI to partition incoming logs into their logical components and parses them to extract relevant fields for an SRE to validate, approve, or modify. Imagine a world where you simply point your logs to a single endpoint, and everything just works. Less wrestling with grok patterns, configuring processors, and hunting for the right plugin. All of which significantly reduces the complexity. Streams is a big step towards realizing that vision.

As a result, SREs are freed from managing complex ingestion pipelines, allowing them to spend less time on data wrangling and more time preventing service disruptions.

Solve incidents faster with Significant Events

Significant Events, a capability within Streams, uses AI to automatically surface major errors and anomalies, enabling you to be proactive in your investigations. So, instead of just combing through endless noise, you can focus on the events that truly matter, such as startup and shutdown messages, out-of-memory errors, internal server failures, and other significant signals of change. These events act as actionable markers, giving SREs early warning and clear focus to begin an investigation before service impact.

With this new foundation, logs will become your primary signal for investigation. The panicked, manual search for a needle in a digital haystack is about to be over. Significant Events acts like a smart metal detector that sifts through the chaos and only beeps when it finds issues, helping you to easily ignore all that hay and find the "needle" faster.

Now imagine the same scenario we started with. Instead of starting a frantic, time-consuming grep through terabytes of logs. Streams has already done the heavy lifting. Its AI-driven analysis has detected a new, anomalous pattern that began before your support team even knew about it and automatically surfaced it as a significant event. Rather than you hunting for a clue, the clue finds you.

With a single click, you have the why: a Java out-of-memory error in a specific service component. This is your starting point. You find the root cause in under two minutes and begin remediation. The customer impact is stopped, the dev team gets the specific error, and the problem is contained before it can escalate. In this case, metrics and traces were unhelpful in finding the why. The answer was waiting in the logs all along.

This ideal outcome is possible because you can both afford to keep every log and instantly find the signal within them. Elastic's cost-efficient architecture with powerful compression, searchable snapshots, and data tiering makes full retention a reality. From there, Streams automatically surfaces the significant event, ensuring that the answer is never lost in the noise.

Elastic is the only company that provides an AI-driven log-first approach to elevate your observability signals and make it dramatically faster and easier to get to why. This is built on our decades of leadership in search, relevance, and powerful analytics that provides the foundation for understanding logs at a deep, semantic level.

The vision for Streams

The partitioning, parsing, and Significant Events you see today is just the starting point. The next step in our vision is to use the Significant Events to automatically generate critical SRE artifacts. Imagine Streams creating intelligent alerts, on-the-fly investigation dashboards, and even data-driven SLOs based only on the events that actually impact service health. From there, the goal is to use AI to drive automated Root Cause Analysis (RCA) directly from log patterns and generate remediation runbooks, turning a multi-hour hunt into an instant resolution recommendation.

Once this AI-drive log foundation is in place, our vision for Streams expands to become a unified intelligence layer that operates across all your telemetry data. It’s not just about making each signal better in isolation, but about understanding the context and relationships between them to solve complex problems.

For metrics, Streams won’t just alert you to a single metric spike but detect a correlated anomaly across multiple, seemingly unrelated metrics e.g. p99 latency for a specific service, rise in garbage collection time, transaction success rate.

Similarly, for traces it identifies a new, unexpected service call (e.g., a new database or an external API) appears in a critical transaction path after a deployment or identifies specific span is suddenly responsible for a majority of errors across all traces, even if the overall error rate hasn't breached a threshold.

The goal is not to have separate streams for logs, metrics, and traces, but to weave them into a single narrative that automatically correlates all three signals. Ultimately, Streams is about fundamentally changing the goal from human led data gathering exercise to proactive, AI-driven resolution.

For more on Streams:

Read the Streams launch blog

Look at the Streams website

Live logs and prosper: fixing a fundamental flaw in observability