The next evolution of observability: unifying data with OpenTelemetry and generative AI

The Observability industry today stands at a critical juncture. While our applications generate more telemetry data than ever before, this wealth of information typically exists in siloed tools, separate systems for logs, metrics, and traces. Meanwhile, Generative AI is hurtling toward us like an asteroid about to make a tremendous impact on our industry.

As SREs, we've grown accustomed to jumping between dashboards, log aggregators, and trace visualizers when troubleshooting issues. But what if there was a better way? What if AI could analyze all your observability data holistically, answering complex questions in natural language, and identifying root causes automatically?

This is the next evolution of observability. But to harness this power, we need to rethink how we collect, store, and analyze our telemetry data.

The problem: siloed data limits AI effectiveness

Traditional observability setups separate data into distinct types:

Metrics: Numeric measurements over time (CPU, memory, request rates)
Logs: Detailed event records with timestamps and context
Traces: Request journeys through distributed systems
Profiles: Code-level execution patterns showing resource consumption and performance bottlenecks at the function/line level

This separation made sense historically due to the way the industry evolved. Different data types have traditionally had different cardinality, structure, access patterns and volume characteristics. However, this approach creates significant challenges for AI-powered analysis:

Metrics (Prometheus) → "CPU spiked at 09:17:00"
Logs (ELK) → "Exception in checkout service at 09:17:32" 
Traces (Jaeger) → "Slow DB queries in order-service at 09:17:28"
Profiles (pyroscope) -> "calculate_discount() is taking 75% of CPU time"

When these data sources live in separate systems, AI tools must either:

Work with an incomplete picture (seeing only metrics but not the related logs)
Rely on complex, brittle integrations that often introduce timing skew
Force developers to manually correlate information across tools

Imagine asking an AI, "Why did checkout latency spike at 09:17?" To answer comprehensively, it needs access to logs (to see the stack trace), traces (to understand the service path), and metrics (to identify resource strain). With siloed tools, the AI either sees only fragments of the story or requires complex ETL jobs that are slower than the incident itself.

Why traditional machine learning (ML) falls short

Traditional machine learning for observability typically focuses on anomaly detection within a single data dimension. It can tell you when metrics deviate from normal patterns, but struggles to provide context or root cause.

ML models trained on metrics alone might flag a latency spike, but can't connect it to a recent deployment (found in logs) or identify that it only affects requests to a specific database endpoint (found in traces). They behave like humans with extreme tunnel vision, seeing only a fraction of the relevant information and only the information that a specific vendor has given you an opinionated view into.

This limitation becomes particularly problematic in modern microservice architectures where problems frequently cascade across services. Without a unified view, traditional ML can detect symptoms but struggles to identify the underlying cause.

The solution: unified data with enriched logs

The solution is conceptually simple but transformative: unify metrics, logs, and traces into a single data store, ideally with enriched logs that contain all signals about a request in a single JSON document. We're about to see a merging of signals.

Think of traditional logs as simple text lines:

[2025-05-19 09:17:32] ERROR OrderService - Failed to process checkout for user 12345

Now imagine an enriched log that contains not just the error message, but also:

The complete distributed trace context
Related metrics at that moment
System environment details
Business context (user ID, cart value, etc.)

This approach creates a holistic view where every signal about the same event sits side-by-side, perfect for AI analysis.

How generative AI changes things

Generative AI differs fundamentally from traditional ML in its ability to:

Process unstructured data: Understanding free-form log messages and error text
Maintain context: Connecting related events across time and services
Answer natural language queries: Translating human questions into complex data analysis
Generate explanations: Providing reasoning alongside conclusions
Surface hidden patterns: Discovering correlations and anomalies in log data that would be impractical to find through manual analysis or traditional querying

With access to unified observability data, GenAI can analyze complete system behavior patterns and correlate across previously disconnected signals.

For example, when asked "Why is our checkout service slow?" a GenAI model with access to unified data can:

Analyze unified enriched logs to identify which specific operations are slow and to find errors or warnings in those components
Check attached metrics to understand resource utilization
Correlate all these signals with deployment events or configuration changes
Present a coherent explanation in natural language with supporting graphs and visualizations

Implementing unified observability with OpenTelemetry

OpenTelemetry provides the perfect foundation for unified observability with its consistent schema across metrics, logs, and traces. Here's how to implement enriched logs in a Java application:

import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.metrics.Meter;
import io.opentelemetry.api.metrics.DoubleHistogram;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;
import java.lang.management.ManagementFactory;
import java.lang.management.OperatingSystemMXBean;

public class OrderProcessor {
    private static final Logger logger = LoggerFactory.getLogger(OrderProcessor.class);
    private final Tracer tracer;
    private final DoubleHistogram cpuUsageHistogram;
    private final OperatingSystemMXBean osBean;

    public OrderProcessor(OpenTelemetry openTelemetry) {
        this.tracer = openTelemetry.getTracer("order-processor");
        Meter meter = openTelemetry.getMeter("order-processor");
        this.cpuUsageHistogram = meter.histogramBuilder("system.cpu.load")
                                      .setDescription("System CPU load")
                                      .setUnit("1")
                                      .build();
        this.osBean = ManagementFactory.getOperatingSystemMXBean();
    }

    public void processOrder(String orderId, double amount, String userId) {
        Span span = tracer.spanBuilder("processOrder").startSpan();
        try (Scope scope = span.makeCurrent()) {
            // Add attributes to the span
            span.setAttribute("order.id", orderId);
            span.setAttribute("order.amount", amount);
            span.setAttribute("user.id", userId);
            // Populate MDC for structured logging
            MDC.put("trace_id", span.getSpanContext().getTraceId());
            MDC.put("span_id", span.getSpanContext().getSpanId());
            MDC.put("order_id", orderId);
            MDC.put("order_amount", String.valueOf(amount));
            MDC.put("user_id", userId);
            // Record CPU usage metric associated with the current trace context
            double cpuLoad = osBean.getSystemLoadAverage();
            if (cpuLoad >= 0) {
                cpuUsageHistogram.record(cpuLoad);
                MDC.put("cpu_load", String.valueOf(cpuLoad));
            }
            // Log a structured message
            logger.info("Processing order");
            // Simulate business logic
            // ...
            span.setAttribute("order.status", "completed");
            logger.info("Order processed successfully");
        } catch (Exception e) {
            span.recordException(e);
            span.setAttribute("order.status", "failed");
            logger.error("Order processing failed", e);
        } finally {
            MDC.clear();
            span.end();
        }
    }
}

This code demonstrates how to:

Create a span for the operation
Add business attributes
Add current CPU usage
Link everything with consistent IDs
Record exceptions and outcomes in the backend system

When configured with an appropriate exporter, this creates enriched logs that contain both application events and their complete context.

Powerful queries across previously separate data

With data that has not yet been enriched, there is still hope. Firstly with GenAI powered ingestion it is possible to extract key fields to help correlate data such as a session id's. This will help you enrich your logs so they get the structure they need to behave like other signals. Below we can see Elastic's Auto Import mechanism that will automatically generate ingest pipelines and pull unstructured information from logs into a structured format perfect for analytics.

Once you have this data in the same data store, you can perform powerful join queries that were previously impossible. For example, finding slow database queries that affected specific API endpoints:

FROM logs-nginx.access-default 
| LOOKUP JOIN .ds-logs-mysql.slowlog-default-2025.05.01-000002 ON request_id 
| KEEP request_id, mysql.slowlog.query, url.query 
| WHERE mysql.slowlog.query IS NOT NULL

This query joins web server logs with database slow query logs, allowing you to directly correlate user-facing performance with database operations.

For GenAI interfaces, these complex queries can be generated automatically from natural language questions:

"Show me all checkout failures that coincided with slow database queries"

The AI translates this into appropriate queries across your unified data store, correlating application errors with database performance.

Real-world applications and use cases

Natural language investigation

Imagine asking your observability system:

"Why did checkout latency spike at 09:17 yesterday?"

A GenAI-powered system with unified data could respond:

"Checkout latency increased by 230% at 09:17:32 following deployment v2.4.1 at 09:15. The root cause appears to be increased MySQL query times in the inventory-service. Specifically, queries to the 'product_availability' table are taking an average of 2300ms compared to the normal 95ms. This coincides with a CPU spike on database host db-03 and 24 'Lock wait timeout' errors in the inventory service logs."

Here's an example of Claude Desktop connected to Elastic's MCP (Model Context Protocol) Server which demonstrates how powerful natural language investigations can be. Here we ask Claude "analyze my web traffic patterns" and as you can see it has correctly identified that this is in our demo environment.

Unknown problem detection

GenAI can identify subtle patterns by correlating signals that would be missed in siloed systems. For example, it might notice that a specific customer ID appears in error logs only when a particular network path is taken through your microservices—indicating a data corruption issue affecting only certain user flows.

Predictive maintenance

By analyzing the unified historical patterns leading up to previous incidents, GenAI can identify emerging problems before they cause outages:

"Warning: Current load pattern on authentication-service combined with increasing error rates in user-profile-service matches 87% of the signature that preceded the April 3rd outage. Recommend scaling user-profile-service pods immediately."

The future: agentic AI for observability

The next frontier is agentic AI, systems that not only analyze but take action automatically.

These AI agents could:

Continuously monitor all observability signals
Autonomously investigate anomalies
Implement fixes for known patterns
Learn from the effectiveness of previous interventions

For example, an observability agent might:

Detect increased error rates in a service
Analyze logs and traces to identify a memory leak
Correlate with recent code changes
Increase the memory limit temporarily
Create a detailed ticket with the root cause analysis
Monitor the fix effectiveness

This is about creating systems that understand your application's behavior patterns deeply enough to maintain them proactively. See how this works in Elastic Observability, in the screenshot at the end of the RCA we are sending an email summary but this could trigger any action.

Business outcomes

Unifying observability data for GenAI analysis delivers concrete benefits:

Faster resolution times: Problems that previously required hours of manual correlation can be diagnosed in seconds
Fewer escalations: Junior engineers can leverage AI to investigate complex issues before involving specialists
Improved system reliability: Earlier detection and resolution of emerging issues
Better developer experience: Less time spent context-switching between tools
Enhanced capacity planning: More accurate prediction of resource needs

Implementation steps

Ready to start your observability transformation? Here's a practical roadmap:

Adopt OpenTelemetry: Standardize on OpenTelemetry for all telemetry data collection and use it to generate enriched logs.
Choose a unified storage solution: Select a platform that can efficiently store and query metrics, logs, traces and enriched logs together
Enrich your telemetry: Update application instrumentation to include relevant context
Create correlation IDs: Ensure every request has identifiers
Implement semantic conventions: Follow consistent naming patterns across your telemetry data
Start with focused use cases: Begin with high-value scenarios like checkout flows or critical APIs
Leverage GenAI tools: Integrate tools that can analyze your unified data and respond to natural language queries

Remember, AI can only be as smart as the data you feed it. The quality and completeness of your telemetry data will determine the effectiveness of your AI-powered observability.

Generative AI: an evolutionary catalyst for observability

The unification of observability data for GenAI analysis represents an evolutionary leap forward comparable to the transition from Internet 1.0 to 2.0. Early adopters will gain a significant competitive advantage through faster problem resolution, improved system reliability, and more efficient operations. GAI is a huge step for increasing observability maturity and moving your team to a more proactive stance.

Think of traditional observability as a doctor trying to diagnose a patient while only able to see their heart rate. Unified observability with GenAI is like giving that doctor a complete health picture, vital signs, lab results, medical history, and genetic data all accessible through natural conversation.

As SREs, we stand at the threshold of a new era in system observability. The asteroid of GenAI isn't a threat to be feared, it's an opportunity to evolve our practices and tools to build more reliable, understandable systems. The question isn't whether this transformation will happen, but who will lead it.

Will you?