Computer Use, plainly · Claw Planet

The thirty-second version#

Computer Use is Claude controlling a computer — taking screenshots, deciding where to click, what to type, when to scroll, until a task is done. You wire Claude up to a virtual machine (or a sandboxed real machine); Claude uses an MCP-style tool named computer (with actions like screenshot, left_click, type, key, mouse_move, scroll), plus optional bash and text_editor tools; your harness executes the actions against the OS.

This is beta. Many Claude 4.x models support Computer Use across two beta header versions — see §API.3 Models for the current model matrix. Capability is real for narrow use cases; failure modes are loud for most things. Don’t put it in front of customers; do experiment with it for personal automation.

The shape#

┌─────────────────────────┐
│  Your harness           │
│  (a Python script)      │
│                         │
│  ┌──────────────────┐   │       ┌──────────────────┐
│  │  Claude API call │ ◄─┼────── │  Claude model    │
│  └────────┬─────────┘   │       │  (Sonnet 4.6)    │
│           │             │       └──────────────────┘
│           │ "click here"
│           ▼
│  ┌──────────────────┐   │
│  │  Action executor │   │
│  │  (pyautogui /    │   │
│  │   xdotool / etc) │   │
│  └────────┬─────────┘   │
│           │ click happens
│           ▼
│  ┌──────────────────┐   │
│  │  The OS          │   │
│  │  (in a VM!)      │   │
│  └──────────────────┘   │
│           │             │
│           ▼ screenshot  │
│  ┌──────────────────┐   │
│  │  Back to Claude  │   │
│  └──────────────────┘   │
└─────────────────────────┘

Claude sees the screenshot, decides the next action, returns a tool_use block. Your harness executes the action (click X,Y · type “hello” · scroll down 300px), takes another screenshot, sends it back. Loop until Claude says “done.”

What it can do (today, narrow)#

Repeat UI tasks — open an app, click a sequence of menus, type into a form
Form filling from structured data — read a CSV, fill out a web form for each row
Accessibility-style flows — operate apps that don’t have APIs but do have UIs
Test web apps end-to-end (alternative to Playwright when the test scenarios are exploratory)
Data scraping from interfaces that block APIs

What it can’t / shouldn’t do#

Anything where mistakes are expensive. It clicks the wrong button regularly. Don’t let it operate your bank.
Anything time-sensitive. Each step costs a Claude call (~5–15 seconds). A 50-click task takes minutes.
Long tasks. Context fills, decisions degrade. 20-30 actions is realistic; 200+ rarely is.
High-precision pixel work. Drawing, design, anything that needs sub-pixel accuracy.
Tasks where the UI changes. Claude memorises positions; a layout shift between runs confuses it.

The safety pattern (mandatory)#

⚠️ Computer Use needs a sandbox. Not optional.

The model can click anywhere on the screen it sees. If you point it at your daily-driver desktop:

It might click on your password manager and read your vault
It might send your real email
It might accept terms of service on your behalf
It might rm -rf something if you have a terminal open

The Anthropic reference implementation runs Claude against a disposable Docker container with a virtual desktop (Xvfb + a window manager). When the task is done, you nuke the container. Anthropic’s quickstart at github.com/anthropics/anthropic-quickstarts ships exactly this.

Variations that are also OK:

A real laptop dedicated to Claude (no real accounts, no sensitive data, scope limited)
A VM (VirtualBox, Parallels, Hyper-V) with a snapshot you can revert
A cloud sandbox (e.g. an Anthropic-provided sandbox once that ships at GA)

Variations that are NOT OK:

Your daily-driver computer with your real accounts logged in
A shared machine
Anything where the agent’s mistake costs you real money or data

What it actually costs#

Computer Use is expensive per task:

Each step = one Claude API call with a screenshot
Screenshots count as image tokens (~ a few thousand tokens each)
A 30-step task could easily burn 100K+ input tokens

This is the trade-off. You’re paying for the model’s vision + reasoning on every step. For a 1-minute manual task, it’s not worth automating. For a tedious 30-minute task you’ll repeat 20 times, the math works out.

A scenario that’s working today#

You want to bulk-update product descriptions in an internal admin tool. The tool has no API. The descriptions live in a spreadsheet you’ve prepared.

The harness:

Reads row 1 from the spreadsheet
Boots a sandboxed container with a Chromium pointed at the admin tool (already logged in via a session cookie)
Asks Claude: “navigate to the product editor for SKU <row.sku>, update the description field to <row.description>, save, then close the editor”
Claude takes screenshots, clicks through, types, saves
The harness moves to row 2, repeat

Reliability is “good enough to save time but you spot-check 10% by hand.” That’s the sweet spot.

A scenario that’s NOT working today#

You want to use Computer Use as your daily assistant — “read my emails and reply to the ones from clients.”

Failure modes:

Misclicks (the wrong email gets opened)
Sends drafts that read off-tone for real client communication
Doesn’t know context that lives in your head (which client is which)
Operates 10× slower than you would

This is what “narrow use case beta” means. Wait for GA or use something purpose-built (a real email assistant with proper hooks into your inbox).

What to do next#

§CU.2 Getting started with Computer Use — clone the demo, run a first task
§API.4 Common patterns — the tool-use loop Computer Use builds on
§CC.8 Pitfalls — many same gotchas apply (rate limits, runaway cost)

`⌘` + `K` · `/`	open search
`j`	next entry (within section)
`k`	previous entry
`g` `h`	go to home
`g` `m`	go to methodology
`?`	show this help
`esc`	close any modal