October 14, 2024

OpenTelemetry (OTel) for LLM Observability

Explore the challenges of LLM observability and the current state of using OpenTelemetry (OTel) for standardized instrumentation.

Introduction to OpenTelemetry Logo

OpenTelemetry is an open-source observability framework designed to handle the instrumentation of applications for collecting traces, metrics, and logs. It helps developers monitor and troubleshoot complex systems by providing standardized tools and practices for data collection and analysis.

OpenTelemetry supports various exporters and backends, making it flexible and adaptable to different environments. By using OpenTelemetry, applications can achieve better visibility into their operations, aiding in root cause analysis and performance optimization.

Goal of this post

This post is a high-level overview of the challenges of LLM observability and the current state of using OpenTelemetry (OTel) for LLMOps.

OTel is geared towards general observability, and traces are a great standardized way to capture LLM application data – we have recorded a webinar on this. While we are excited about OTel and the roadmap towards it across LLMOps tools, non-OTel LLMOps tools are preferred by many teams. This post explores why this is the case and how OTel can address these challenges in the future.

Example trace of our public demo

Outline

  1. Overview of LLM Application Observability
    • Unique Challenges
    • Comparison with Traditional Observability
    • Experimentation vs. Production Monitoring
  2. OpenTelemetry (OTel) for LLM Observability
    • Current State
    • My Personal View

1. Overview of LLM Application Observability

LLM Application Observability refers to the ability to monitor and understand how Large Language Model applications function, especially focusing on aspects like performance, reliability, and user interactions. This involves collecting and analyzing data such as traces, metrics, and logs to troubleshoot issues and optimize the application.

Unique Challenges

LLM applications present distinct challenges compared to traditional software systems. Evaluating the quality of LLM outputs is inherently complex due to their non-deterministic nature. Metrics like cost, latency, and quality must be balanced and cannot be purely derived from traces as they are in traditional applications.

Additionally, the interactive and context-sensitive nature of LLM tasks often requires real-time monitoring and rapid adaptation. Addressing these challenges demands robust tools and frameworks that can handle the dynamic and evolving nature of LLM applications.

Comparison with Traditional Observability

Traditional observability focuses on identifying exceptions and compliance with expected behaviors. LLM observability, however, requires monitoring dynamic and stochastic outputs, making it harder to standardize and interpret.

ObservabilityLLM Observability
Async instrumentation (not in critical path)
Spans / traces (as core abstractions)
Metrics
ExceptionsAt runtimeEx-post (evaluations, annotations, user feedback, …)
Main use casesAlerts, metrics, aggregated performance breakdownsDebug single traces, build datasets for application benchmarking/testing, monitor hallucinations/evals
UsersOpsMLE, SWE, data scientists, non-technical
FocusHolistic systemFocus on what’s critical for LLM application

Experimentation vs. Production Monitoring

In development, experimentation with various models and configurations is crucial. Developers iterate on different approaches to fine-tune model behavior, optimize performance metrics, and explore new functionalities.

Production monitoring, however, shifts the focus to real-time performance tracking. It involves constant vigilance to ensure the application runs smoothly, identifying any latency issues, tracking costs, and integrating user interactions and feedback to continuously improve the application. Both phases are essential, but they have distinct objectives and methodologies geared towards pushing the boundaries of what the LLM can achieve and ensuring it operates reliably in real-world scenarios.

DevelopmentProduction
Debug step-by-step, especially when using frameworksMonitor: cost / latency / quality
Run experiments on datasets and evaluationsDebug issues identified in prod based on user feedback, evaluations, and manual annotations
Document and share experimentsCluster user intents

2. OpenTelemetry (OTel) for LLM Observability

Current State

The OpenTelemetry Special Interest Group (SIG) focused on “Generative AI Observability” pushes for standardized semantic conventions for LLM/GenAI Applications and instrumentation libraries for the most popular model vendors and frameworks. Learn more about the SIG in its project doc and meeting notes.

Deliverables of the working group (as of Oct 14, 2024) include:

Immediate term:

  • Ship OTel instrumentation libraries for OpenAI (or any other GenAI client) in Python and JS following existing conventions

Middle term:

  • Ship OpenTelemetry (or native) instrumentations for popular GenAI client libraries in Python and JS covering chat calls
  • Evolve GenAI semantic conventions to cover other popular GenAI operations such as embeddings, image or audio generation

As a result, we should have feature parity with the instrumentations of existing GenAI Observability vendors for a set of client instrumentation libraries that all vendors can depend upon.

Long term:

  • Implement instrumentations for GenAI orchestrators and GenAI frameworks for popular libraries in different languages
  • Evolve GenAI and other relevant conventions (DB) to cover complex multi-step scenarios such as RAG
  • Propose mature instrumentations to upstream libraries/frameworks

Currently, there’s a mix of progress and ongoing challenges. Significant issues include dealing with large traces, diverse LLM schema implementations (often biased towards OpenAI), and capturing evaluations and annotations. Many OTel-based LLM instrumentation libraries don’t strictly adhere to evolving conventions, resulting in vendor-specific solutions.

My Personal View

Despite the challenges, I’m excited about OTel instrumentation in the mid-term. The real value lies in its standardized data model, enabling seamless workflow integration across various frameworks and platforms. This standardization leads to increased interoperability across vendors, which is the main reason why OTel is interesting. Currently, we maintain countless integrations with popular models/frameworks/languages but can’t support the long tail due to capacity constraints. Standardizing on OTel will allow the ecosystem to crowdsource instrumentation efforts, benefiting everyone and enabling LLMOps vendors to focus more on core features rather than maintaining numerous integrations. These developments are essential for achieving consistent and reliable observability across diverse LLM frameworks and platforms.

We are committed to OTel and are happy to contribute to the SIG. We will continue to maintain our integrations and SDKs and are currently exploring adding an OTel collector to allow for integrations with OTel-based instrumentation libraries.

💡

If you are interested in contributing to our OTel efforts, join the GitHub Discussion thread.

Get Started

If you want to get started with tracing your AI applications with Langfuse today, check out our quickstart guide on how to use Langfuse with multiple LLM building frameworks like Langchain or LlamaIndex.

If you are curious about why Traces are a good fit for LLM observability, check out our webinar on the topic.

Was this page useful?

Questions? We're here to help

Subscribe to updates