AI Architect Academy

The model controls the machine

Computer use: how AI models control a computer

Short answer

Computer use is a model capability: the AI perceives a computer screen through screenshots and issues mouse and keyboard actions to operate software the way a person does — across any application, not just a browser. It runs in a loop — look at the screen, decide the next action, execute it, look again — until the goal is met. The model supplies the perception and the decisions; your code supplies the sandbox and runs the actions.

This page covers what computer use is, how Anthropic's computer-use tool works, what it can and can't do today, where it earns its keep, and the real safety risks. For the browser-specific case of this capability, see browser agents; for the system around the model, see agentic AI architecture.

What is computer use

Computer use is the ability of an AI model to operate a computer through its graphical interface: it sees the screen as an image and acts by moving the cursor, clicking, typing, and scrolling. That is the whole idea — instead of calling an API or driving a script, the model uses the same surface a human does, so it can in principle operate any application that renders to a screen: a browser, a desktop app, a spreadsheet, a terminal, a legacy tool with no integration at all.

This is what makes it different from classic robotic process automation. A recorded RPA macro replays fixed coordinates and breaks the moment a layout shifts; a computer-use model reasons about what is on the screen at run time and adapts. It is also broader than a browser agent — a browser agent is computer use aimed at one (very common) target, the web browser, whereas computer use as a capability spans the whole desktop. The browser is the most valuable case, but not the only one.

How it works: the perceive-act loop

Every turn answers two questions: what is on the screen? (perception) and what should I do next? (action). Anthropic ships this as a Claude tool, and the cycle is the standard agentic loop:

  • Screenshot. Your application captures the current screen and sends it to the model as an image.
  • Reason. Claude reads the screenshot, plans, and emits a tool-use request — for example "click at [500, 300]" or "type this text" — returning a stop_reason of tool_use.
  • Act. Your code extracts the action, executes it in a sandboxed environment, and captures a fresh screenshot.
  • Repeat. The new screenshot goes back as a tool_result, and the loop continues until the task is done or a budget runs out.

A key detail for architects: the computer tool's schema is built into the model — you enable it by passing a tool of type computer_20251124 (with the screen dimensions) and the computer-use-2025-11-24 beta header, not by writing the action schema yourself (Anthropic, Computer use tool). The model never touches the machine directly; it only requests actions that your loop runs in an environment you control — Anthropic's reference implementation runs a virtual X11 display inside a Docker container with Firefox and LibreOffice preinstalled. The actions Claude can request are a small, fixed vocabulary:

Action groupExamplesWhat it does
Perceivescreenshot, cursor_position, zoomCapture the screen, read the cursor, inspect a region at full resolution.
Mouseleft_click, right_click, double_click, mouse_move, left_click_drag, scrollClick, move, drag, and scroll at pixel coordinates [x, y].
Keyboardtype, key, hold_keyType text and press keys or shortcuts, including modifier combinations.

OpenAI took the same screenshot-and-act route with its Computer-Using Agent, first shipped as the Operator research preview and since folded into ChatGPT Agent (OpenAI, Computer-Using Agent; OpenAI, Introducing Operator). The pattern is converging: perceive the pixels, emit low-level input events, close the loop.

What it can and can't do today

Computer use is real and improving — on WebArena, a benchmark for autonomous web navigation across real sites, Claude reports state-of-the-art results among single-agent systems (Anthropic, Computer use tool). But it is still a beta capability, and four limits decide whether one survives contact with production:

  • Latency. Every turn is screenshot → reason → act → re-render, and image perception adds tokens and round-trips. Anthropic's own guidance steers it toward tasks where speed isn't critical — background gathering, automated testing — not interactive, human-paced work.
  • Coordinate accuracy. The model can misread or hallucinate the exact pixel to click; clicking the wrong place is a real failure mode, which is why extended thinking and careful screenshot sizing help.
  • Reliability on long tasks. Per-step errors compound — niche apps, multiple apps at once, dropdowns, scrollbars, and complex spreadsheet selection remain hard, and a single missed step can void a whole run.
  • Cost. A loop that sends a fresh screenshot and calls a strong model every turn is expensive by default; budget and model choice matter. See LLM cost optimization.

The honest summary: computer use can complete multi-step tasks end to end in a controlled environment, but it is not yet a drop-in replacement for a reliable API or a human operator on anything irreversible. Treat it as a capable, supervised assistant, not unattended infrastructure.

Use cases

Computer use earns its keep where work is screen-shaped, repetitive, and the target has no clean API to call instead:

Use caseWhat the model doesWhy computer use over a script
Software testing / QADrives an app through real user journeys and checks the result on screen.Adapts to UI changes that break a hardcoded selector or coordinate.
RPA / back-officeLogs into portals, fills forms, moves records between systems.Automates legacy desktop and web apps that expose no API.
Research & data gatheringNavigates unfamiliar interfaces, reads results, extracts and structures data.Handles sources it has never seen without a bespoke scraper per site.
General agentsCompletes multi-step goals across several apps on a user's behalf.Composes a flow from a natural-language goal instead of a fixed runbook.

The common thread is the absence of a better interface. Where a stable API exists, call it — it is faster, cheaper, and more reliable than driving its front end. Computer use is the option of last resort that happens to be remarkably general.

Risks and safe deployment

This is the sharp edge, and it is a design constraint, not a feature to add later. A computer-use model reads untrusted screen content and can take real actions, which raises two first-class risks:

  • Prompt injection through the screen. Instructions hidden in a web page or an image the model screenshots can hijack it — Anthropic notes the model may follow commands found in content even when they conflict with the user's instructions. As a defense, classifiers run on the screenshots to flag likely injections and steer the model to ask for confirmation before acting (Anthropic, Computer use tool).
  • Unintended actions. A confused or hijacked model can click, type, or submit something irreversible — a purchase, a delete, a sent message — because it has the same reach a user does.

The mitigations are the same discipline any agent needs, applied strictly: run in a dedicated virtual machine or container with minimal privileges; keep sensitive data and credentials out of reach; restrict internet access to an allowlist of domains; and require human confirmation for any action with real-world consequences. Bound the loop with turn and action budgets so a spinning agent stops instead of running forever. The threat surface here is the agent-security problem in its most concrete form — designed for, not patched on.

Computer use vs browser agents

The two terms are often used interchangeably, but the distinction is clean and worth holding:

  • Computer use is the capability — perceive a screen, control mouse and keyboard — independent of which application is on that screen. It can drive a desktop app, a terminal, a spreadsheet, or a browser.
  • A browser agent is an application of that capability aimed at the web browser. Many browser agents also use a cheaper, structured alternative — reading the DOM or accessibility tree instead of screenshots — which computer use, by definition, does not.

So every screenshot-driven browser agent is doing computer use, but not all computer use is browser work, and not all browser agents use computer use. If your target is the web specifically, start at browser agents, where DOM control is usually the faster default; if your target is the whole desktop, computer use is the general tool.

Frequently asked questions

What is computer use?

Computer use is a model capability: the AI perceives a computer screen through screenshots and controls the mouse and keyboard to operate software the way a person does. It runs in a loop — look at the screen, decide the next action, execute it, look again — and works across any application that renders to a screen, not just a browser. It differs from scripted automation because the model reasons about the screen at run time and adapts when the interface changes.

What is Claude computer use?

Claude computer use is Anthropic's implementation of the capability, shipped as a beta tool. You enable it by adding a tool of type computer_20251124 with the screen dimensions and the computer-use-2025-11-24 beta header; the action schema is built into the model. Claude then returns mouse and keyboard actions that your code runs in a sandbox, capturing a new screenshot after each step. A reference implementation packages a virtual display, the tool code, and the agent loop in a Docker container.

How does computer use work?

It works as a perceive-act loop. Your application sends a screenshot of the screen; the model reasons and emits a tool-use request such as a click at given coordinates or some text to type; your code executes that action in a controlled environment and captures a fresh screenshot; and the result feeds back so the model can decide the next step. This repeats without further user input until the task is complete or a budget is reached.

What can computer use do?

It can operate graphical software end to end: open and navigate apps, click buttons, fill and submit forms, type into fields, scroll, and move data between tools. That makes it useful for automated software testing, robotic process automation on legacy portals, research and data gathering across unfamiliar interfaces, and general multi-step agents. It is strongest where there is no API to call and the steps vary run to run.

Is computer use safe?

Not by default — it reads untrusted screen content and can take real actions, so prompt injection (instructions hidden in a page or image) and unintended irreversible actions are first-class risks. Deploy it safely by running in a minimal-privilege virtual machine or container, keeping sensitive data and credentials out of reach, allowlisting domains, requiring human confirmation for consequential actions, and bounding the loop. Anthropic also runs classifiers that flag likely injections and prompt for confirmation, but those precautions remain necessary.

Computer use vs browser agents — what's the difference?

Computer use is the capability — perceiving a screen and controlling mouse and keyboard — for any application. A browser agent is that capability (or a cheaper DOM-based alternative) applied specifically to a web browser. Every screenshot-driven browser agent does computer use, but computer use also covers desktop apps and terminals, and many browser agents drive the page through its structure rather than pixels. Pick browser agents for web-only targets; pick computer use when the whole desktop is in scope.

Sources & provenance

Computer use is a beta feature; tool versions, model support, and API shapes change. Verify the current beta header and tool type against the live docs before building. Corrections: hello@aiarch.dev.

Learn to build agents that act, not just answer.

AI Architect Academy teaches the agentic loop, tool design, guardrails, and evaluation as first-class skills — on a platform that is itself a production agentic system built across Anthropic, AWS, and Cloudflare. The build is the curriculum.

Free sample — no signup · every claim cited · cancel anytime

Or get notified when new tracks ship.