Agents that use the web
Browser agents: how AI agents drive the browser
A browser agent is an AI agent that perceives and acts on web pages the way a person does — it navigates, clicks, types, and extracts data to complete a goal in a real browser. It runs the same agentic loop any agent does, but its tools are browser actions, and it perceives the page two ways: by reading the DOM or accessibility tree (DOM control) or by looking at screenshots with a vision model (computer use).
This page covers what a browser agent is, the two ways they perceive and act, where they earn their keep, the real limits, and how to build and evaluate one. For the broader idea of goal-pursuing agents, see what is agentic AI; for the system around the model, see agentic AI architecture.
What is a browser agent
A browser agent is an AI agent whose action space is a web browser. Give it a goal — "find the cheapest flight on these dates", "fill in this onboarding form", "pull every invoice from this portal" — and it works the page in a loop: read the current state, decide the next action, execute it, observe the result, repeat until the goal is met or a budget runs out. The model is the reasoning core; the browser is the tool layer.
The term overlaps with web agents and AI browser automation, and it sits next to the broader idea of computer use — an agent driving a whole desktop, of which the browser is one (very common) target. What makes a browser agent distinct from classic automation is that it reasons about the page at run time instead of replaying a fixed script: when a layout changes or an unexpected modal appears, it adapts rather than breaks.
How browser agents work: computer use vs DOM control
Every browser agent has to answer two questions on each turn — what is on the page? (perception) and what should I do? (action). There are two dominant approaches, and the difference is mostly in perception.
| Approach | How it perceives | How it acts | Tradeoff |
|---|---|---|---|
| Computer use (vision) | Screenshots of the rendered page, read by a vision model. | Pixel-level mouse and keyboard — move, click at coordinates, type, scroll. | Works on any UI, including canvas and images; slower, costlier, and can misread coordinates. |
| DOM control (structured) | The DOM or accessibility tree — element roles, names, and references as text. | Targets elements directly (click this ref, fill this field) via a driver like Playwright. | Deterministic, token-efficient, fast; blind to anything not exposed in the DOM. |
Computer use is the approach Anthropic ships as a Claude tool: the model receives screenshots and emits mouse and keyboard actions inside a sandbox you control, with the tool schema built into the model rather than supplied by you (Anthropic, Computer use tool). OpenAI took the same screenshot-and-act route with its Computer-Using Agent, first shipped as the Operator research preview and since folded into ChatGPT Agent and exposed through the Agents SDK (OpenAI, Computer-Using Agent; OpenAI, Introducing Operator).
DOM control drives the page through a structured representation instead of pixels. Playwright's MCP server, for example, hands the agent the accessibility tree — a semantic, text-based snapshot of roles, labels, and states — so the agent can act without a vision model and at a fraction of the tokens (Playwright; microsoft/playwright). Open-source Browser Use blends both: it lets an LLM drive a real Chromium browser from the page's structure rather than brittle selectors, reaches an 89.1% success rate on the WebVoyager benchmark, and works with whatever model provider you point it at. In practice many production agents combine the two — DOM control as the fast default, vision as the fallback when the structure isn't enough.
What browser agents are good for
Browser agents shine where the work is genuinely web-shaped, repetitive, and the target has no clean API. A few categories carry most of the value:
| Use case | What the agent does | Why an agent over a script |
|---|---|---|
| QA & end-to-end testing | Drives user journeys, asserts outcomes, reports failures. | Adapts to UI changes that would break a hardcoded selector. |
| Research & data gathering | Navigates sites, reads results, extracts and structures data. | Handles sites it has never seen without bespoke scrapers. |
| Web RPA / back-office | Logs into portals, fills forms, moves records between systems. | Automates legacy apps and portals that expose no API. |
| Task completion | Books, orders, or files on a user's behalf, end to end. | Composes multi-step flows from a single natural-language goal. |
The common thread: the page is the interface, the steps vary run to run, and there is no integration you could call instead. Where a stable API exists, call the API — it is faster, cheaper, and more reliable than driving its front end.
Limits and risks: reliability and security
Browser agents are the part of the agent world where the demo-to-production gap is widest. Four limits decide whether one survives contact with real users:
- Reliability. Long web flows still fail often. Standalone Operator was retired in part because real-world tasks like completing a checkout — with complex JavaScript, CAPTCHAs, and session quirks — landed below a 50% success rate (OpenAI, Introducing Operator). Benchmarks have climbed since, but a single missed step can void a whole run, so per-step accuracy compounds badly over long tasks.
- Latency. Every turn is perceive → reason → act → re-render. Screenshot-based perception adds image tokens and round-trips, so a vision agent can take seconds per step and minutes per task — fine for back-office batches, painful for anything interactive.
- Cost. A loop that sends a fresh screenshot and calls a strong model every turn is expensive by default. DOM control is markedly cheaper per step, which is one reason it is the common default. See how to evaluate an LLM agent for pinning quality against that cost.
- Security. This is the sharp edge. A browser agent reads untrusted web content and can take real actions, which makes prompt injection a first-class threat: a page can carry instructions that hijack the agent. Anthropic runs classifiers that flag likely injections in screenshots and steer the model to ask for confirmation before acting (Anthropic, Computer use tool). Treat it as a design constraint: least-privilege scopes, human-in-the-loop on irreversible actions, and a sandboxed browser — not a feature to bolt on later.
Building and evaluating a browser agent
Building one is the standard agentic loop with browser tools, plus the guardrails the threat surface demands:
- Pick perception first. Default to DOM control for speed, cost, and determinism; reach for computer-use vision only where the DOM doesn't carry the information (canvas, images, deeply custom widgets). A blend is common.
- Bound the loop. Cap turns and actions with a budget and an escalation path, so a confused agent stops instead of spinning. This is the same discipline as any agent — see the bounded agentic loop.
- Sandbox and scope. Run in an isolated browser with least-privilege credentials, and gate irreversible actions (purchases, deletes, sends) behind human confirmation.
- Defend against injection. Assume page content is hostile; separate instructions from observed content, and prefer agents that flag and confirm on suspicious input.
Evaluation is where browser agents are won or lost, because the same task can succeed one run and fail the next. You pin quality the way the field does — task-completion benchmarks (WebVoyager, WebArena) plus your own end-to-end suite of representative flows, scored on success rate, steps to completion, cost per task, and safety violations. Spot-checking a happy path is not evaluation. The full method is in how to evaluate an LLM agent.
Frequently asked questions
What is a browser agent?
A browser agent is an AI agent whose tools are browser actions. It pursues a goal in a real browser by navigating, clicking, typing, and extracting data in a loop — reading the page state, deciding the next action, executing it, and observing the result until the goal is met. Unlike a fixed automation script, it reasons about the page at run time, so it adapts when layouts change.
How do browser agents work?
They run an agentic loop over two questions per turn: what is on the page, and what to do next. Perception comes either from screenshots read by a vision model (computer use) or from the DOM and accessibility tree as text (DOM control). Actions are mouse and keyboard events or direct element targeting via a driver like Playwright. The model plans; the browser executes; the result feeds back in.
What is computer use?
Computer use is the approach where a model perceives a screen through screenshots and acts with pixel-level mouse and keyboard control, like a person. Anthropic ships it as a Claude tool with the schema built into the model; OpenAI's Computer-Using Agent took the same route. A browser agent is the common case of computer use aimed at a browser, though computer-use agents can also drive desktop apps and the command line.
What are browser agents used for?
The strongest cases are web-shaped, repetitive work with no clean API: end-to-end QA and testing, research and data gathering across sites, web RPA and back-office tasks on legacy portals, and end-to-end task completion like booking or ordering. Where a stable API exists, call it instead — driving its front end is slower, costlier, and less reliable.
Are browser agents reliable and safe?
Reliability on long, real-world flows is still limited — per-step errors compound, and complex checkouts, CAPTCHAs, and session quirks remain hard. Safety is the bigger concern: because the agent reads untrusted pages and can act, prompt injection is a first-class threat. Mitigate with least-privilege scopes, a sandboxed browser, human-in-the-loop on irreversible actions, and injection classifiers that confirm before acting.
How do you build a browser agent?
Run the standard agentic loop with browser tools. Choose perception first — DOM control by default for speed and cost, vision where the DOM falls short. Bound the loop with turn and action budgets, sandbox the browser, scope credentials to least privilege, gate irreversible actions behind confirmation, and defend against prompt injection. Then evaluate with task-completion benchmarks and your own end-to-end suite, not happy-path spot checks.
- Computer-use tool behaviour (screenshots, mouse/keyboard, built-in schema, prompt-injection classifiers): Anthropic, Computer use tool (platform docs).
- Computer-Using Agent and the Operator-to-ChatGPT-Agent path, including the sub-50% real-world success figure: OpenAI, Computer-Using Agent; OpenAI, Introducing Operator.
- DOM / accessibility-tree control and Playwright MCP: Playwright; microsoft/playwright.
- Open-source Chromium-driven agent and the 89.1% WebVoyager success rate: browser-use/browser-use.
Browser-agent capabilities and benchmark numbers move fast; verify current figures and API shapes against each vendor's live docs before building. Corrections: hello@aiarch.dev.
Learn to build agents that act, not just answer.
AI Architect Academy teaches the agentic loop, tool design, guardrails, and evaluation as first-class skills — on a platform that is itself a production agentic system built across Anthropic, AWS, and Cloudflare. The build is the curriculum.
Free sample — no signup · every claim cited · cancel anytime
Or get notified when new tracks ship.