For DevOps, SRE, and cloud engineers

Your ops instincts are the moat: running agents in production

By Wibo · Amsterdam Published 13 Jun 2026 Last updated 26 Jul 2026 ~7 min read

Short answer

The hard part of agent systems isn't getting a demo to work — it's running one reliably, affordably, and safely in production. That is exactly the gap prompt-first newcomers can't fill and exactly what you already do: observability, cost control, reliability, security, deployment.

Moving into AI as a DevOps/SRE/cloud engineer isn't starting over. It's pointing instincts you already have at a new kind of workload — one that's probabilistic, calls tools, and can take real actions.

The tracing instincts this page maps onto agent systems are layer 8 of the AI quality stack: observability — the layer the other seven can't substitute for. An eval suite tells you a system should work; only a trace tells you why a specific production run didn't, which is exactly the debugging reflex that transfers straight from an ops background.

Part of the AI quality stack — the layered gate chain for knowing your LLM is delivering quality.

The map: what you do already → what it becomes

Your current discipline	Its agent-system equivalent
Monitoring & tracing	Agent tracing: every step, tool call, token count, and latency in a run; spotting where a loop went wrong
Capacity & cost management	Token cost-modeling, model routing (Opus → Sonnet → Haiku), prompt caching, per-route cost in an AI Gateway
Reliability / SLOs / error budgets	Evals as the test suite for non-deterministic systems: does it reach a correct outcome across runs?
Rate limiting & circuit breakers	Turn/tool-call budgets, done-conditions, and kill switches that stop a runaway agent
IAM & least privilege	Least-privilege tools, scoped credentials, human-in-the-loop on irreversible actions
Incident response	Detecting and containing excessive agency, prompt injection, and tool misuse

The four things that actually matter in production

1. Evals — the test suite for probabilistic systems

You can't unit-test an agent to correctness; the same input can take a different path. Evals replace that: a scored, repeatable suite that asks "did it reach an acceptable outcome, across runs?" Designing and running evals is the single biggest "this person has actually shipped with LLMs" signal in hiring. It's also just SLOs and test coverage, reframed for non-determinism.

2. Cost — the bill is a control-plane problem

Multi-turn agents re-send a growing context every iteration, so input tokens dominate. The levers are the ones you'd reach for instinctively: cache the static prefix (prompt caching cuts cached input dramatically), route cheap work to a smaller model, cap turns, and instrument per-route cost so you can see the hotspot. Inference is usually cheap; uncontrolled loops are what bite.

3. Safety — bound the agency

The headline risk is OWASP LLM06, Excessive Agency: an agent with too much power and too little supervision. The mitigations are least privilege, human-in-the-loop on high-risk or irreversible actions, explicit budgets and done-conditions, and a kill switch. If you've ever scoped an IAM role or put a circuit breaker on a dependency, you already think this way.

We shipped a guardrail misconfiguration on this exact stack that is worth naming, because it is the kind of thing a demo never surfaces. This platform's own coach routes through an authenticated Cloudflare AI Gateway in front of OpenRouter, with the Gateway's content guardrails turned on. We first set them to block mode — and the block rule fired on the site's own prompt-injection lesson, a legitimate security question that happened to contain the kind of language guardrails are built to catch. Verified live: the same question passes cleanly under flag mode and gets blocked (HTTP 424) under block mode. That one hazard category now stays in flag mode; the other thirteen still block, on both the prompt and the response side. That is a logged decision, not an oversight — a security curriculum has to be able to talk about the attacks it teaches, and the narrowest change that lets it is one category, not the whole gateway. A second, unrelated production bug bit the same code path: Amazon Bedrock serving Claude Sonnet 5 enables extended thinking by default, and those reasoning tokens count against max_tokens, so a hard-thinking turn could exhaust its budget and return no visible reply at all. The fix is one line in src/lib/llm.ts's buildChatBody — reasoning: { effort: 'none' } on every Anthropic-slug request. Both bugs were invisible until we actually ran real traffic through the gateway; see the defense-in-depth pattern for how the budget/guardrail/fallback layers stack.

4. Observability — you can't operate what you can't see

Treat an agent run like a distributed trace: capture each step, tool input/output, token usage, latency, and the stop reason. When something goes wrong — wrong tool, runaway loop, silent failure — the trace is how you find it. Same discipline, new span type.

The mindset shift to make

The one habit to unlearn: the instinct to make everything deterministic and control every code path. In an agent, you hand the per-step decision to the model and your code becomes the harness around it. You stop chasing a deterministic path and start verifying outcomes with evals and bounding behavior with guardrails. Everything else you already know still applies.

AgentOps vs LLMOps: what's the difference?

LLMOps is the broader practice of operating LLM applications — prompts, model versions, evaluation, monitoring, and cost. AgentOps is the slice that deals with agents specifically: systems that call tools and take actions in a loop. The extra concerns are agency-shaped — bounding tool use, tracing multi-step runs, containing excessive agency, and stopping runaway loops. If LLMOps is "operate the model," AgentOps is "operate the thing the model is allowed to do." For a senior ops engineer, both map cleanly onto skills you already have; agents just raise the stakes on least-privilege and observability.

Why this is the defensible landing zone

Plenty of newcomers can write a clever prompt. Very few can take an agent that works in a notebook and make it observable, affordable, reliable, and safe enough to put in front of customers. That's the scarce skill, it's where production systems break, and it's the part of the job you're already most of the way to.

Sources & provenance

OWASP — Top 10 for LLM Applications (2025): LLM06 Excessive Agency and LLM01 Prompt Injection, for the safety framing.
Anthropic — Claude Agent SDK and agent-building guidance: tool use, the agentic loop, and prompt caching.
Course material: aiArch Track B (Agentic Systems) — agentic loop, evals, cost, safety.
Cloudflare AI Gateway docs: guardrails, spend limits, and dynamic routing with fallbacks.

Conceptual overview; specific API shapes and pricing change — verify against current provider docs before implementing. Sources checked 27 Jun 2026. Corrections: hello@aiarch.dev.

Turn your ops background into an AgentOps career.

aiArch teaches evals, cost-modeling, safety, and deployment as first-class skills, mapped onto the instincts you already have — across Anthropic, AWS, and Cloudflare.

Try a sample lesson free → Browse the curriculum

See how aiArch helps senior engineers become AI-native, or compare Professional Membership pricing.

Free sample — no signup · every claim cited · full curriculum is waitlist-only