Interview prep, decoded

AI engineer interview questions (with senior-level answers)

Q: What is asked in the system design round?

You are given an open prompt such as design a support agent over our docs that can take limited actions. A strong answer clarifies scope and constraints first, sketches a bounded agentic loop with least-privilege tools, states a per-step model and cost strategy, makes evaluation a component, draws the trust boundary for untrusted input, and closes on observability and fallbacks. The grading is on scoping and justified tradeoffs, not a single right design.

By Wibo · Amsterdam Published 26 Jun 2026 Last updated 26 Jun 2026 ~9 min read

Short answer

AI engineer interviews test six things: LLM and agent fundamentals, RAG and retrieval, evaluation, cost and model selection, agentic system design, and production safety — plus a behavioural round on judgement and shipping. They are mostly not machine-learning interviews. You are being checked on whether you can build, evaluate, and operate systems on top of existing models — reason about a non-deterministic, metered, tool-calling component the way a senior engineer reasons about any other dependency.

Below are the questions that actually come up, grouped by round, each with a concise model answer and a note on what the interviewer is really probing for. The strongest signal in every answer is the same: you have shipped something real and you can defend the tradeoffs.

What AI engineer interviews test

Most loops are four to six rounds. The titles vary, but the underlying categories are stable — and knowing which category a question belongs to tells you what a strong answer looks like. Map your prep to these:

Category	What it probes	A strong answer shows
Fundamentals	Mental model of LLMs, tokens, context, the agentic loop.	You reason about non-determinism and cost, not buzzwords.
Agents, RAG & tools	Retrieval design, tool use, MCP, when an agent is warranted.	You know when not to use RAG or an agent.
Evals	Measuring correctness when outputs vary.	You have run an eval suite, not just eyeballed outputs.
Cost & model selection	Routing, caching, blended cost at production volume.	You can do the per-token arithmetic out loud.
System design	Designing an agentic feature end to end under constraints.	You scope, name failure modes, and justify the pattern.
Production & safety	Trust boundary, prompt injection, observability, governance.	You treat the model as untrusted input, not magic.
Behavioural	Judgement, shipping, working with ambiguity.	Specific stories with a decision and an outcome.

For where this role sits and what it pays, see the AI roles, decoded; to build the prerequisites first, how to become an AI engineer.

Fundamentals questions

The opening round checks your mental model. Shallow answers recite definitions; strong ones reason about consequences.

Q: Why is a temperature-zero LLM call still not deterministic in practice? Even at temperature 0 you are taking the argmax, but floating-point non-associativity across GPU batches, kernel and hardware differences, and model or routing updates on the provider side mean identical inputs can yield different tokens. The senior point: design as if every call can vary, because you cannot pin the provider's stack. That is why evals and tolerant assertions exist.

Probing for: whether you treat the model as a controllable function or as a dependency you must defend against.

Q: What is a token, and why should you care about it? A token is a sub-word unit; roughly four characters or 0.75 words in English. You care because it is the unit of three things at once: the context-window limit, latency (output tokens are generated serially), and cost (priced per million input and output tokens, usually with output more expensive). Almost every optimisation — caching, truncation, model choice — is a token-budget decision.

Probing for: that you connect the abstraction to cost and latency, not just the definition.

Q: What is the difference between the context window and memory? The context window is the fixed token budget the model sees on a single call — it has no persistence. Memory is something you build around the model: summarising prior turns, storing state in a database, or retrieving relevant history and re-injecting it. The model never remembers; your system does. Conflating the two is a classic junior mistake.

Probing for: that you understand the model is stateless and the engineering is in the surrounding system.

Agents, RAG & tools questions

This is the largest band of questions and the one where building experience shows fastest.

Q: When would you use RAG, and when would you not? Use retrieval when answers depend on knowledge that is large, changing, private, or must be cited — you fetch the relevant chunks and ground the generation in them. Do not reach for RAG when the knowledge fits in the prompt, when the task is reasoning rather than recall, or when fine-tuning or a tool call is a better fit. The failure mode to name: retrieval that returns plausible-but-wrong context still produces confident wrong answers, so retrieval quality, not the model, is usually the bottleneck.

Probing for: judgement about when retrieval is the wrong tool — not just that you can describe the pipeline.

Q: Walk me through a RAG pipeline and where it breaks. Chunk and embed the corpus into a vector store; at query time embed the query, retrieve the top-k nearest chunks (often with a re-ranking pass), assemble them into the prompt with the question, and generate a grounded, cited answer. It breaks at retrieval far more than generation: bad chunking, poor embeddings, k too small or too large, no re-ranker, and stale indexes. Measure retrieval separately (recall, precision) from answer quality, or you will tune the wrong stage.

Probing for: that you can isolate and measure each stage rather than treating it as one black box.

Q: When is an agent warranted instead of a single prompt or a fixed chain? Use an agent — a model in a loop that decides which tool to call next — when the steps are not knowable in advance and the path depends on intermediate results. If the workflow is fixed, a deterministic chain is cheaper, faster, and easier to test. Agents buy flexibility at the cost of latency, token spend, and unpredictability, so you bound the loop: a step cap, a budget, and a stop condition. The senior instinct is to reach for the simplest thing that works and only add agency when the task genuinely needs it.

Probing for: restraint — that you do not over-engineer an agent where a chain suffices. See our guide to evaluating an LLM agent for how to keep one honest.

Q: What is MCP and why does it matter for tool use? The Model Context Protocol is an open standard for exposing tools, data, and prompts to a model through a uniform interface, so an integration written once works across clients instead of being re-glued per app. It matters because it turns tool wiring into a reusable boundary and makes least-privilege tool design explicit — which tools a given agent may call becomes a configuration decision, not buried code.

Probing for: awareness of how tool ecosystems are standardising, and that you think about tool permissions as a boundary.

Evals, cost & model selection questions

These two rounds separate people who have shipped from people who have demoed.

Q: How do you evaluate a system whose output is non-deterministic? Build an eval set of representative inputs with known-good criteria, then score programmatically: exact or fuzzy match for closed answers, assertion checks for structured output, and an LLM-as-judge with a rubric for open-ended quality — validated against human labels so you trust the judge. Track the suite over time, gate releases on it, and treat regressions like failing tests. The headline: you replace vibes with a measurable, repeatable signal.

Probing for: that you have actually run evals and understand their limits (judge bias, set drift), not that you can name the technique.

Q: A feature costs too much per request. How do you bring it down? Work the token budget. Route cheap steps (classification, routing) to a small model and reserve the strong model for hard reasoning; use prompt caching for the stable system-prompt prefix; trim retrieved context and history; shorten outputs and request structured formats; and batch where latency allows. Then measure blended cost per request at real volume, not list price for one call. The discipline is to quantify before optimising.

Probing for: a structured cost model and the routing instinct. See evals to confirm cheaper choices do not drop quality.

Q: How do you choose between a large and a small model for a step? Start from the task's difficulty and the cost of being wrong. Prototype the hard path on the strongest model to establish the quality ceiling, then try to move each step down to the cheapest model that still passes your evals. Cheap models for routing, extraction, and formatting; strong models for multi-step reasoning and ambiguous judgement. The answer is per-step, driven by evals and cost, never one model for everything.

Probing for: that model selection is an evidence-based, per-step decision rather than a brand preference.

System design questions

The AI system design round is the closest analogue to a classic design interview, adapted for a probabilistic, metered core. A common prompt: design a customer-support agent that can answer from our docs and take limited actions. A strong walkthrough does this:

Clarify scope and constraints first. Volume, latency budget, cost ceiling, which actions are allowed, and the tolerance for a wrong answer. Never start drawing boxes before you know the constraints.
Sketch the loop. Retrieval over the docs for grounding, a bounded agent for multi-step requests, an explicit set of tools with least privilege, and a stop condition. Justify each — especially why a bounded agent over a fixed chain.
Name the model strategy. Which model per step and the blended cost at the stated volume, with prompt caching on the system prefix.
Make evaluation a component, not an afterthought. An eval set, a judge with a rubric, and a release gate.
Draw the trust boundary. Retrieved content and user input are untrusted; tools that take actions need authorisation and confirmation; log every tool call for audit.
Close on operations. Observability, cost alerting, a fallback when the model or a tool fails, and a human-in-the-loop path for low-confidence cases.

The grading is not whether your design is "right" — it is whether you scope before designing, justify the pattern, and surface failure modes unprompted. The reusable patterns behind this round are in the curriculum's design track.

Production & safety questions

The round that catches people who have only built prototypes. It centres on one idea: the model and everything it ingests are untrusted.

Q: How do you defend against prompt injection? Treat all model-visible content — user input, retrieved documents, tool outputs, web pages — as untrusted instructions that may try to hijack the agent. There is no single fix; you layer: least-privilege tools so a hijack cannot do much, human confirmation on consequential actions, separating trusted instructions from untrusted data in the prompt, output validation and allow-lists, and monitoring for anomalous tool calls. The honest framing is risk reduction, not a guarantee — and that honesty is itself the signal.

Probing for: that you think in trust boundaries and defence-in-depth, and that you do not claim a silver bullet.

Q: What do you log and monitor for an LLM feature in production? Inputs and outputs (with PII handling), token usage and cost per request, latency, tool calls and their results, eval scores on sampled traffic, and refusal or error rates. You want to catch quality drift, cost spikes, and abuse early, and you need full traces to debug a non-deterministic system after the fact. Observability is not optional here — without traces a bad answer is unreproducible.

Probing for: operational maturity — that you have run one of these in production, not just shipped it once.

Q: How do you handle a model that hallucinates in a user-facing flow? Reduce it and contain it. Ground answers in retrieval, ask for citations and verify them, constrain output to validated structures, and lower the cost of error with confidence thresholds that route uncertain cases to a human or a safe fallback. Then measure the residual rate with evals and set an acceptable bar for the use case. You never promise zero; you engineer the consequence of being wrong down to acceptable.

Probing for: that you manage hallucination as a measurable risk with containment, not as a bug to be fully eliminated.

How to prepare

The fastest preparation is not flashcards — it is having built something you can defend. A focused plan:

Ship one real agent. A tool-calling system end to end, with retrieval, an eval suite, and a cost model. Every round above gets easier when you can answer from a thing you actually built. See how to become an AI engineer for the build path.
Run a real eval suite. Even a small one. Being able to say "I gated releases on a 50-case eval set with an LLM judge validated against human labels" outscores any definition.
Do the cost arithmetic out loud. Practise estimating blended cost per request and where you would route to save money. Interviewers love watching you reason about tokens live.
Rehearse one system design end to end. Scope, loop, models, evals, trust boundary, operations — in that order, out loud, timed.
Prepare three behavioural stories with a real decision and outcome: a tradeoff you made, a thing you shipped, and a failure you learned from.
Tighten the artefacts. A resume that leads with shipped systems and a repo an interviewer can read beat a list of courses.

This is the same loop AI Architect Academy is built around: build a production system, evaluate it, cost it, and write the rationale — which is exactly what the interview is testing.

Frequently asked questions

What questions are asked in an AI engineer interview?

They cluster into six technical areas plus a behavioural round: LLM and agent fundamentals (tokens, context, non-determinism), RAG and retrieval, evaluation, cost and model selection, agentic system design, and production safety such as prompt injection and observability. Most are applied — "design this", "how would you measure that", "bring this cost down" — rather than trivia.

How do I prepare for an AI engineer interview?

Build and ship one real tool-calling agent with retrieval, an eval suite, and a cost model, because most answers get easier when you can speak from something you built. Then practise the cost arithmetic out loud, rehearse one system design end to end, and prepare three specific behavioural stories with a decision and an outcome.

What is asked in the system design round?

You are given an open prompt such as "design a support agent over our docs that can take limited actions." A strong answer clarifies scope and constraints first, sketches a bounded agentic loop with least-privilege tools, states a per-step model and cost strategy, makes evaluation a component, draws the trust boundary for untrusted input, and closes on observability and fallbacks. The grading is on scoping and justified tradeoffs, not a single right design.

Do AI engineer interviews require machine learning?

Mostly no. AI engineering is about building systems on top of existing models, so the loops centre on agents, retrieval, evals, cost, and production rather than training models or deriving gradients. Some teams add light ML-concept questions, but deep ML or model-training depth is rarely the bar for an applied AI engineer role.

What should I build before interviewing?

A real, end-to-end agent: a tool-calling system grounded in retrieval, with a small eval suite gating quality and a cost model showing per-request economics and where you route to a cheaper model. A notebook demo is not enough — interviewers want evidence you handled evals, cost, and the trust boundary, which is where production reality lives.

How is an AI engineer interview different from a software engineer one?

It keeps the software-engineering core — coding, system design, behavioural — but adds a probabilistic, metered, tool-calling component. You are tested on reasoning about non-determinism, measuring correctness with evals instead of unit tests alone, budgeting tokens and cost, and defending a trust boundary against prompt injection. The systems-design instincts transfer; the AI-native layer is what is new.

Sources & provenance

Question categories and model answers synthesized from 2026 AI-engineering interview loops and AI Architect Academy's backward-designed curriculum (docs/CURRICULUM.md, docs/DESIGN.md).
Technical claims (tokens, RAG stages, agentic loop bounds, prompt-injection defence-in-depth, LLM-as-judge) reflect current public Anthropic, AWS, and Cloudflare engineering guidance.
Interview structure is directional — loop length, round names, and weighting vary by employer and seniority.

Interview formats are not standardized and vary by company — treat this as a map of what is commonly tested, not a guarantee of any specific loop. Corrections: hello@aiarch.dev.

Walk into the interview with a shipped system, not a reading list.

AI Architect Academy teaches agentic design, evals, cost-modelling, safety, and deployment as first-class skills — and makes you build and defend a real production AI system, which is exactly what these interviews test. The build is the curriculum.

Try a sample lesson free → Browse the curriculum

Free sample — no signup · every claim cited · cancel anytime