AI Architect Academy

The operational plane

LLM observability: tracing, monitoring, and debugging agents in production

Short answer

LLM observability is runtime visibility into an LLM or agent system: the traces, metrics, and logs that let you see what a model and its agent loop actually did on a given request, so failures are diagnosable in production rather than mysterious. A trace records each step — every model call, tool call, and retrieval — with its inputs, outputs, tokens, latency, and cost. It answers what happened on this run.

This is not the same as evaluation, which measures whether outputs are correct across a dataset. Observability tells you what the system did; evaluation tells you whether it was good. You need both, and this page covers the first.

What LLM observability means

Observability is a property borrowed from systems engineering: a system is observable if you can infer its internal state from its outputs. For traditional services, those outputs are the classic three signals — traces, metrics, and logs. LLM observability applies the same idea to a model-driven system, where the hardest-to-see internal state is the model's own decisions: which tool it chose, what it sent, what came back, and why the loop kept going.

The reason it is a named discipline at all is non-determinism. A normal function given the same input returns the same output; a stack trace pins the bug. An LLM given the same input can return different text, call a different tool, or take a different number of steps. You cannot reproduce a production failure by re-running it locally, so the only way to debug is to have captured what happened the first time. Observability is that capture. It is one part of the operational plane that wraps every production agent system.

The signals: traces, metrics, logs

LLM observability rests on the same three signals as any observable system, specialised for model and agent behaviour.

SignalWhat it capturesQuestion it answers
TracesThe full execution path of one request as a tree of spans — each model call, tool call, and retrieval, nested by parent, with timing and status.What did the agent do, step by step, and where did it go wrong?
MetricsNumeric aggregates over many requests — token usage, latency, cost, error rate, throughput, and per-step counts.How is the system behaving in aggregate, and is it drifting?
LogsThe structured payloads — prompts, completions, tool arguments and results, and events like a retry or a budget cutoff.What were the exact inputs and outputs at this point?

Traces are the signal that makes LLM observability distinct. In a single completion there is little to trace; in an agent that loops, the trace is the whole story — a parent span for the run, child spans for each model turn, grandchild spans for each tool call inside that turn. Token, latency, and cost are attached to spans as attributes, which is also where cost attribution comes from: you can see exactly which step in which run spent the budget.

Observability vs evaluation vs monitoring

These three are routinely conflated, and the distinction is worth holding precisely because the tooling overlaps.

ConceptQuestionOperates on
ObservabilityWhat did the system do on this run?Live production traffic, one request at a time
EvaluationAre the outputs correct or good?A dataset of inputs with expected behaviour, run offline or online
MonitoringIs anything wrong right now?Metrics and thresholds, with alerting on top

Monitoring is best read as a use of observability data, not a separate thing: you monitor by putting alerts on the metrics that observability emits — a latency spike, an error-rate climb, a cost-per-request jump. Evaluation is the genuinely separate discipline. It needs a notion of ground truth or a judge, and it answers a question observability never can: was the answer right? For how to build that correctness harness — datasets, LLM-as-judge, regression gates — see the LLM evaluation guide. The practical link between them is that traces captured by observability become the raw material for evaluation datasets.

Observability for agents: per-step tracing

Single-call observability is nearly trivial: log the prompt, the completion, and the token count. Agents are where it earns its keep, for two reasons.

  • Multi-step. An agent run is a loop of model turns and tool calls, sometimes dozens deep. A failure three tool calls in is invisible unless every step is traced with its own span. Agent tracing is exactly this — capturing the nested span tree of a loop so you can open the failing run and walk it.
  • Non-deterministic control flow. The agent decides its own path, so two runs of the same task can diverge. Without a trace you cannot tell whether a bad result came from a wrong tool choice, a bad tool result, a context that overflowed, or the loop hitting its budget. The trace makes the branch point visible.

This is why a bounded agent loop and observability are designed together: the same loop that enforces a turn and tool-call budget is the natural place to emit a span per iteration. See agentic AI architecture for where this sits in the system, and note that a gateway in front of the model — such as an AI gateway — gives you a second, infrastructure-level vantage point: request logs, token metrics, and caching visibility without touching application code.

OpenTelemetry and the GenAI semantic conventions

The portability question for LLM observability is the same one that vendor-neutral tracing solved years ago: if every tool defines its own span format, you are locked in. OpenTelemetry — the open standard for traces, metrics, and logs — addresses this for GenAI through its GenAI semantic conventions, a common vocabulary of gen_ai.* attributes for model calls, token-usage metrics, and tool and agent spans, so an instrumented app can emit to any compatible backend.

The honest caveat: as of mid-2026 these conventions are still experimental (OpenTelemetry marks them in Development status), so attribute names can change and you should pin a version. They are nonetheless already widely adopted, and several tools below speak them natively. Treat OpenTelemetry as the wire format you instrument against, and a tool as the backend that stores and visualises what it carries.

The main tools

The space spans open-source projects, dedicated commercial platforms, and LLM modules inside established APM suites. The list below is representative, not a ranking — the right choice depends on whether you want to self-host, how much you value an open standard, and whether LLM observability needs to live next to your existing infrastructure monitoring. Verify licensing and features against each project's own docs before committing.

ToolTypeWhat it is
LangfuseOpen source (MIT), with managed cloudTracing, evals, prompt management, and datasets for LLM and agent apps; OpenTelemetry-compatible and self-hostable.
LangSmithCommercial (proprietary)Framework-agnostic trace capture plus offline and online evals; managed cloud with self-hosted enterprise options.
Arize PhoenixSource-available (Elastic License 2.0)Tracing and evals built on OpenTelemetry and the Apache-2.0 OpenInference conventions; self-hostable.
Datadog LLM ObservabilityCommercial (part of a wider APM platform)LLM and agent traces, operational metrics, and quality and safety evals alongside existing infrastructure monitoring.
OpenTelemetry GenAIOpen standard (experimental)Vendor-neutral gen_ai.* conventions for spans and metrics — the instrumentation layer the tools above can ingest, not a backend itself.

A reasonable default heuristic: if you want an open standard and self-hosting, start from an OpenTelemetry-native open-source tool; if LLM observability must sit next to existing infrastructure monitoring, an APM vendor's module reduces moving parts; if you want the deepest LLM-specific tracing and eval workflow out of the box, a dedicated platform is usually furthest ahead. None of these removes the need for a separate evaluation harness.

Frequently asked questions

What is LLM observability?

LLM observability is runtime visibility into an LLM or agent system through three signals — traces, metrics, and logs. A trace records the full execution path of one request: every model call, tool call, and retrieval, with inputs, outputs, tokens, latency, and cost. It exists because LLMs are non-deterministic, so you cannot reproduce a production failure by re-running it; the only way to debug is to have captured what happened the first time.

How is observability different from evaluation?

Observability tells you what the system did on a given run; evaluation tells you whether the output was correct or good. Observability operates on live traffic one request at a time and needs no ground truth. Evaluation operates on a dataset with expected behaviour or a judge, and produces a quality score. They are complementary: traces captured by observability often become the dataset that evaluation runs against. See the evaluation guide for the correctness side.

What is agent tracing?

Agent tracing is capturing the full nested span tree of an agent run — a parent span for the run, child spans for each model turn, and grandchild spans for each tool call inside that turn. Because an agent loops and decides its own path, a failure several steps deep is invisible without per-step tracing. The trace lets you open a failing run and walk it to the exact step that went wrong.

What should you monitor in an LLM app?

The core metrics are token usage, latency, cost per request, error rate, and throughput, plus per-step counts for agents such as tool calls per run and how often the loop hits its budget. You monitor by putting alerts on these — a latency spike, an error-rate climb, or a cost-per-request jump. Quality signals such as refusal rate or failed-tool-call rate are useful too, but correctness itself belongs to evaluation, not monitoring.

What are the best LLM observability tools?

There is no single best; the choice depends on whether you want to self-host, how much you value an open standard, and whether LLM observability needs to sit next to existing infrastructure monitoring. Representative options include the open-source Langfuse, the commercial LangSmith, the source-available Arize Phoenix, and Datadog LLM Observability inside a wider APM platform. OpenTelemetry's GenAI conventions are the instrumentation standard these can ingest. Verify licensing and features against each project's own docs.

Does OpenTelemetry support LLMs?

Yes. OpenTelemetry has GenAI semantic conventions — a common vocabulary of gen_ai.* attributes for model calls, token-usage metrics, and tool and agent spans — so an instrumented app can emit to any compatible backend. As of mid-2026 these conventions are still experimental, marked Development status, so attribute names can change and you should pin a version. They are nonetheless already widely adopted, and several observability tools speak them natively.

Sources & provenance

Licensing and feature sets change; treat the tool table as a starting map, not a guaranteed signature, and confirm against each vendor's live docs before building. Corrections: hello@aiarch.dev.

Learn to build observable agent systems, not just demos.

AI Architect Academy teaches the operational plane — observability, evals, guardrails, and cost control — as first-class skills, on a platform that is itself a production agent system traced end to end across Anthropic, AWS, and Cloudflare. The build is the curriculum.

Free sample — no signup · every claim cited · cancel anytime

Or get notified when new tracks ship.