AI Architect Academy

For DevOps, SRE, and cloud engineers

Your ops instincts are the moat: running agents in production

Short answer

The hard part of agent systems isn't getting a demo to work — it's running one reliably, affordably, and safely in production. That is exactly the gap prompt-first newcomers can't fill and exactly what you already do: observability, cost control, reliability, security, deployment.

Moving into AI as a DevOps/SRE/cloud engineer isn't starting over. It's pointing instincts you already have at a new kind of workload — one that's probabilistic, calls tools, and can take real actions. This is the most defensible landing zone in the whole field.

The map: what you do already → what it becomes

Your current disciplineIts agent-system equivalent
Monitoring & tracingAgent tracing: every step, tool call, token count, and latency in a run; spotting where a loop went wrong
Capacity & cost managementToken cost-modeling, model routing (Opus → Sonnet → Haiku), prompt caching, per-route cost in an AI Gateway
Reliability / SLOs / error budgetsEvals as the test suite for non-deterministic systems: does it reach a correct outcome across runs?
Rate limiting & circuit breakersTurn/tool-call budgets, done-conditions, and kill switches that stop a runaway agent
IAM & least privilegeLeast-privilege tools, scoped credentials, human-in-the-loop on irreversible actions
Incident responseDetecting and containing excessive agency, prompt injection, and tool misuse

The four things that actually matter in production

1. Evals — the test suite for probabilistic systems

You can't unit-test an agent to correctness; the same input can take a different path. Evals replace that: a scored, repeatable suite that asks "did it reach an acceptable outcome, across runs?" Designing and running evals is the single biggest "this person has actually shipped with LLMs" signal in hiring. It's also just SLOs and test coverage, reframed for non-determinism.

2. Cost — the bill is a control-plane problem

Multi-turn agents re-send a growing context every iteration, so input tokens dominate. The levers are the ones you'd reach for instinctively: cache the static prefix (prompt caching cuts cached input dramatically), route cheap work to a smaller model, cap turns, and instrument per-route cost so you can see the hotspot. Inference is usually cheap; uncontrolled loops are what bite.

3. Safety — bound the agency

The headline risk is OWASP LLM06, Excessive Agency: an agent with too much power and too little supervision. The mitigations are least privilege, human-in-the-loop on high-risk or irreversible actions, explicit budgets and done-conditions, and a kill switch. If you've ever scoped an IAM role or put a circuit breaker on a dependency, you already think this way.

4. Observability — you can't operate what you can't see

Treat an agent run like a distributed trace: capture each step, tool input/output, token usage, latency, and the stop reason. When something goes wrong — wrong tool, runaway loop, silent failure — the trace is how you find it. Same discipline, new span type.

The mindset shift to make
The one habit to unlearn: the instinct to make everything deterministic and control every code path. In an agent, you hand the per-step decision to the model and your code becomes the harness around it. You stop chasing a deterministic path and start verifying outcomes with evals and bounding behavior with guardrails. Everything else you already know still applies.

Why this is the defensible landing zone

Plenty of newcomers can write a clever prompt. Very few can take an agent that works in a notebook and make it observable, affordable, reliable, and safe enough to put in front of customers. That's the scarce skill, it's where production systems break, and it's the part of the job you're already most of the way to. Lead with it.

Sources & provenance
  • OWASP — Top 10 for LLM Applications (LLM06: Excessive Agency) for the safety framing.
  • Anthropic — guidance on building agents, tool use, and prompt caching (platform docs).
  • Course material: AI Architect Academy Track B (Agentic Systems) — agentic loop, evals, cost, safety.

This is a conceptual overview; specific API shapes and pricing change — verify against current provider docs before implementing. Corrections: hello@aiarch.dev.

Turn your ops background into an AgentOps career.

AI Architect Academy teaches evals, cost-modeling, safety, and deployment as first-class skills, mapped onto the instincts you already have — across Anthropic, AWS, and Cloudflare.