LLM Observability with Elastic, OpenLIT and OpenTelemetry

The realm of technology is evolving rapidly, and Large Language Models (LLMs) are at the forefront of this transformation. From chat bots to intelligent application copilots, LLMs are becoming increasingly sophisticated. As these applications grow more complex, ensuring their reliability and performance is paramount. This is where observability steps in, aided by OpenTelemetry and Elastic through the OpenLIT instrumentation library.

OpenLIT is an open-source Observability and Evaluation tool that helps take your LLM apps from playground to debug to production. With OpenLit you get an ability to choose from a range of Integrations (across LLMs, VectorDBs, frameworks, and GPUs) to start tracking LLM performance, usage, and costs without hassle. In this blog we will look at tracking OpenAI and LangChain. to send telemetry to an OpenTelemetry compatible endpoint like Elastic.

Elastic supports OpenTelemetry natively, it can take telemetry directly from the application (via the OpenTelemetry SDKs) or through a native OTel collector. No special agents are needed. Additionally Elastic's EDOT provides a supported set of OTel SDKs and an OTel Collector. In this blog we will connect our application directly to Elastic without a collector for simplicity.

Why Observability Matters for LLM Applications

Monitoring LLM applications is crucial for several reasons.

It’s vital to keep track of how often LLMs are being used for usage and cost tracking.
Latency is important to track since the response time from the model can vary based on the inputs passed to the LLM.
Rate limiting is a common challenge, particularly for external LLMs, as applications depend more on these external API calls. When rate limits are hit, it can hinder these applications from performing their essential functions using these LLMs.

By keeping a close eye on these aspects, you can not only save costs but also avoid hitting request limits, ensuring your LLM applications perform optimally.

What are the signals that you should be looking at?

Using Large Language Models (LLMs) in applications differs from traditional machine learning (ML) models. Primarily, LLMs are often accessed through external API calls instead of being run locally or in-house. It is crucial to capture the sequence of events (using traces), especially in a RAG-based application where there can be events before and after LLM usage. Also, analyzing the aggregated data (through metrics) provides a quick overview like request, tokens and cost is important for optimizing performance and managing costs. Here are the key signals to monitor:

Traces

Request Metadata: This is important in the context of LLMs, given the variety of parameters (like temperature and top_p) that can drastically affect both the response quality and the cost. Specific aspects to monitor are:

Temperature: Indicates the level of creativity or randomness desired from the model’s outputs. Varying this parameter can significantly impact the nature of the generated content.
top_p: Decides how selective the model is by choosing from a certain percentage of most likely words. A high “top_p” value means the model considers a wider range of words, making the text more varied.
Model Name or Version: Essential for tracking over time, as updates to the LLM might affect performance or response characteristics.
Prompt Details: The exact inputs sent to the LLM, which, unlike in-house ML models where inputs might be more controlled and homogeneous, can vary wildly and affect output complexity and cost implications.

Response Metadata: Given the API-based interaction with LLMs, tracking the specifics of the response is key for cost management and quality assessment:

Tokens: Directly impacts cost and is a measure of response length and complexity.
Cost: Critical for budgeting, as API-based costs can scale with the number of requests and the complexity of each request.
Completion Details: Similar to the prompt details but from the response perspective, providing insights into the model’s output characteristics and potential areas of inefficiency or unexpected cost.

Metrics

Request Volume: The total number of requests made to the LLM service. This helps in understanding the demand patterns and identifying any anomaly in usage, such as sudden spikes or drops.

Request Duration: The time it takes for a request to be processed and a response to be received from the LLM. This includes network latency and the time the LLM takes to generate a response, providing insights into the performance and reliability of the LLM service.

Costs and Tokens Counters: Keeping track of the total cost accrued and tokens consumed over time is essential for budgeting and cost optimization strategies. Monitoring these metrics can alert you to unexpected increases that may indicate inefficient use of the LLM or the need for optimization.

Implementing Automatic Instrumentation with OpenLIT

OpenLIT automates telemetry data capture, simplifying the process for developers. Here’s a step-by-step guide to setting it up:

1. Install the OpenLIT SDK:

First, you must install the following package:

pip install openlit

Note: OpenLIT currently supports Python, a popular language for Generative AI. The team is also working on expanding support to JavaScript soon.

2. Get your Elastic APM Credentials

Sign in to your Elastic cloud account.
Open the side navigation and click on APM under Observability.
Make sure the APM Server is running

In the APM Agents section, Select OpenTelemetry and directly jump to Step 5 (Configure OpenTelemetry in your application):
Copy and save the configuration value for
OTEL_EXPORTER_OTLP_ENDPOINT
and
OTEL_EXPORTER_OTLP_HEADERS

3. Set Environment Variables:

OpenTelemetry Environment variables for Elastic can be set as follows in linux (or in the code). Elastic OTel Documentation

export OTEL_EXPORTER_OTLP_ENDPOINT="YOUR_ELASTIC_APM_OTLP_url"
export OTEL_EXPORTER_OTLP_HEADERS="YOUR_ELASTIC_APM_AUTH"

Note: Make sure to replace the space after Bearer with %20:

OTEL_EXPORTER_OTLP_HEADERS=“Authorization=Bearer%20[APIKEY]”

4. Initialize the SDK:

You will need to add the following to the LLM Application code.

import openlit
openlit.init()

Optionally, you can customize the application name and environment by setting the

application_name

and

environment

attributes when initializing OpenLIT in your application. These variables configure the OTel attributes

service.name

and

deployment.environment

, respectively. For more details on other configuration settings, check out the OpenLIT GitHub Repository.

openlit.init(application_name="YourAppName",environment="Production")

The most popular libraries in GenAI are OpenAI (for accessing LLMs) and Langchain (for orchestrating steps). An example instrumentation of a Langchain and OpenAI based LLM Application will look like:

import getpass
import os
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
import openlit 

# Auto-instruments LLM and VectorDB calls, sending OTel traces and metrics to the configured endpoint
openlit.init()

os.environ["OPENAI_API_KEY"] = getpass.getpass()
model = ChatOpenAI(model="gpt-4")
messages = [
    SystemMessage(content="Translate the following from English into Italian"),
    HumanMessage(content="hi!"),
]
model.invoke(messages)

Visualizing Data with Kibana

Once your LLM application is instrumented, visualizing the collected data is the next step. Follow the below steps to import a pre-built Kibana dashboard to get yourself started:

Copy the dashboard NDJSON provided here and save it in a file with an extension
.ndjson
.
Log into your Elastic Instance.
Go to Stack Management > Saved Objects.
Click Import and upload your file containing the dashboard NDJSON.
Click Import and you should have the dashboard available.

The dashboard provides an in-depth overview of system metrics through eight key areas: Total Successful Requests, Request Duration Distribution, Request Rates, Usage Cost and Tokens, Top GenAI Models, GenAI Requests by Platform and Environment, Token Consumption vs. Cost. These metrics collectively help identify peak usage times, latency issues, rate limits, and resource allocation, facilitating performance tuning and cost management. This comprehensive breakdown aids in understanding LLM performance, ensuring consistent operation across environments, budget needs, and troubleshooting issues, ultimately optimizing overall system efficiency.

Also, you can see OpenTelemetry Traces from OpenLIT in Elastic APM, letting you look into each LLM request in detail. This setup ensures better system efficiency by helping with model performance checks, smooth running across environments, budget planning, and troubleshooting.

Conclusion

Observability is crucial for the efficient operation of LLM applications. OpenTelemetry's open standards and extensive support, combined with Elastic's APM, AIOps, and analytics and OpenLIT's powerful and easy auto-instrumentation for 20+ GenAI tools from LLMs to VectorDBs, enable complete visibility into LLM performance.

Hopefully, this provides an easy-to-understand walk-through of instrumenting Langchain with OpenTelemetry and OpenLit and how easy it is to send traces into Elastic.

Additional resources for OpenTelemetry with Elastic: