Claw field notebook
last updated 2026-05-15 edit on GitHub colophon
§ 3 Connections / § 3.4 · 5 min read

Models

The brain. Which model providers OpenClaw supports, how to configure model refs, and how to think about model failover and cost.

What “model” means here#

The model is the LLM — the brain that decides what the agent says. OpenClaw is model-agnostic: it talks to whichever provider you’ve configured. You can switch models without changing the agent runtime, the workspace, or the channels.

The model is the most expensive thing in your stack (in API cost) and the most consequential (in quality). Both worth thinking about up front.

Supported providers#

From the docs and README:

ProviderOAuthAPI keyRecommended for
AnthropicClaude family — solid fit for tool use and multi-step reasoning at last review
OpenAI✓ (ChatGPT/Codex subscriptions)If you already pay for ChatGPT Plus/Pro/Codex
GoogleGemini family; cheap for chat
OpenRouterMulti-model routing; good for trying many models with one key
Local (Ollama, etc.)n/an/aPrivacy-sensitive or offline; smaller models only on consumer hardware
Custom providersvariesvariesSelf-hosted commercial inference, Azure OpenAI, AWS Bedrock

The README’s “Sponsors” section lists OpenAI as the OAuth-able subscription. If you’re paying for ChatGPT Plus, your subscription auth flows directly without a separate API key.

Model refs (the format)#

Model refs in config use provider/model:

{
  "agents": {
    "defaults": {
      "model": "anthropic/claude-sonnet-4-6"
    }
  }
}

For OpenRouter-style nested refs, include the provider prefix:

{ "model": "openrouter/moonshotai/kimi-k2" }

If you omit the provider, OpenClaw tries:

  1. An alias defined in your config
  2. A unique configured-provider match for that exact model id
  3. The configured default provider as fallback

If a configured default provider no longer exposes the configured default model, OpenClaw falls back to the first configured provider/model instead of surfacing a stale removed-provider default.

Model selection in 2026#

This section is opinion as of 2026-05. The model landscape changes fast — vendors ship new model generations and shift pricing roughly quarterly. Verify against current provider docs, pricing pages, and your own workload tests before locking in.

The boring answer first: start on Sonnet, set an explicit fallback, and only reach for Opus / GPT-5.5 / Gemini Pro when your actual workload proves it needs them. Most of the picks below are corner cases on top of that.

For agentic work (tool use, multi-step reasoning)#

  • anthropic/claude-sonnet-4-6 — Anthropic’s recommended default for production: a balance of speed and intelligence with 1M-token context. Start here unless cost or latency pushes you elsewhere.
  • anthropic/claude-opus-4-7 — when correctness matters more than cost. Anthropic’s heaviest tier for complex reasoning and agentic coding. 1M context. Roughly 1.7× the Sonnet price on input, 1.7× on output.
  • openai/gpt-5.5 — one of OpenAI’s premium reasoning tiers (alongside GPT-5.2 and GPT-5.4) per GitHub Copilot’s model list — multi-step problem solving and architecture-level analysis.
  • google/gemini-3.1-pro-preview — Google’s current Gemini 3 Pro preview. For a stable fallback, use google/gemini-2.5-pro — a previous generation (1M context), supported until 16 October 2026 shutdown.

For cost-sensitive chat (no heavy tool use)#

  • anthropic/claude-haiku-4-5 — Anthropic’s fastest and cheapest tier. 200K context (not 1M). Surprisingly capable for light tool use and batch classification.
  • openai/gpt-5-mini — one of three OpenAI models included at zero premium-request cost on paid GitHub Copilot plans (the other two are GPT-4.1 and GPT-4o); pay-per-token via the OpenAI API directly. A reliable cheap default for low-stakes tasks.
  • google/gemini-3.1-flash-lite — Google’s lightest Gemini 3 tier, stable since 7 May 2026. A low-cost option for high-volume chat and simple classification; verify current pricing before relying on it as the cheap default. Stable Gemini 2.5 alternative: google/gemini-2.5-flash (same 16 October 2026 shutdown as 2.5 Pro).

For privacy / offline (local)#

  • Llama 3.2 3B (or current Llama 4.x equivalents on Ollama) — runs on a Pi 5 8GB. Quality noticeably below cloud models but acceptable for narrow tasks.
  • Llama 3.1 8B / Qwen 2.5 7B — needs ~8GB RAM available; runs on a Mac M1+ or a Linux desktop with 16GB+.
  • Llama 3.3 70B / Qwen 2.5 32B — needs serious GPU; not for laptops.
# Quick local model setup with Ollama
# llama3.2:3b shown as example; check `ollama list` for current Llama 4.x tags
ollama pull llama3.2:3b
# Then in openclaw.json:
{ "agents": { "defaults": { "model": "ollama/llama3.2:3b" } } }

Model failover#

OpenClaw supports configuring multiple models, with failover when the primary fails. From the model failover docs:

{
  "agents": {
    "defaults": {
      "models": [
        "anthropic/claude-sonnet-4-6",
        "openai/gpt-5.5",
        "google/gemini-3.1-pro-preview"
      ]
    }
  }
}

If the primary fails (rate limit, outage, auth error), OpenClaw tries the next. Auth-profile rotation is also supported — multiple credentials per provider for parallel quotas.

Why this matters: model providers have outages. If your only model is OpenAI and OpenAI is down for 30 minutes, your agent is silent for 30 minutes. With failover, you degrade gracefully.

Cost notes (rough, illustrative figures · 2026-05)#

These are rough estimates, not measured runs. Anthropic and Google values were pulled from each provider’s public pricing page on 2026-05-15. OpenAI direct-API per-token prices aren’t quoted here because the pricing page is JavaScript-rendered and shifts frequently — check it live before budgeting. The “monthly usage” figures are back-of-envelope assumptions about a personal-assistant workload (~1000 messages, ~2000 tokens per exchange — combined input + output), not measurements from a real OpenClaw deployment.

API call cost per 1M tokens at last check (Anthropic pricing · OpenAI pricing · Google Gemini pricing):

ModelInputOutput
Claude Opus 4.7$5$25
Claude Sonnet 4.6$3$15
Claude Haiku 4.5$1$5
Gemini 2.5 Pro (≤200K context)$1.25$10
Gemini 3.1 Pro Preview (≤200K context)$2$12
Gemini 3 Flash Preview$0.50$3
Gemini 3.1 Flash-Lite (stable since 7 May 2026)$0.25$1.50
Gemini 2.5 Flash$0.30$2.50
GPT-5.5 / GPT-5.4 / GPT-5-mininot quoted — check vendor pagenot quoted — check vendor page

For GitHub Copilot users, the per-model request multiplier (not a per-token price) is what matters — those live at docs.github.com/en/copilot/concepts/billing/copilot-requests.

For a back-of-envelope “personal assistant” usage profile (~1000 messages/month, ~2000 tokens/message average) the estimated monthly spend is:

  • Claude Sonnet 4.6: ~$15–25/month (estimate)
  • Claude Haiku 4.5: ~$3–6/month (estimate)
  • Gemini 3.1 Flash-Lite: ~$0.50–1.50/month (estimate)

Heavy agentic workflows (long sessions, multiple tool calls per message) can be 5–10× this. Measure your own usage before you trust the estimate.

Cost-runaway pattern: see §6.3 pattern #3. Set provider-side spending alerts and hard caps before you tinker.

Auth profiles (per-agent)#

Auth profiles live at ~/.openclaw/agents/<agentId>/agent/auth-profiles.json:

{
  "anthropic": { "apiKey": "sk-ant-..." },
  "openai":    { "apiKey": "sk-..." },
  "ollama":    { "host": "http://localhost:11434" }
}

Or via OAuth for OpenAI:

openclaw auth login openai
# opens a browser, signs in with ChatGPT/Codex subscription

Things to try#

  • Run the agent on Sonnet for a week, then Haiku for a week. Notice which tasks degrade. Adjust your default based on real workload, not first-principles.
  • Configure failover Claude → GPT-5.5 → Gemini. Stress-test by temporarily revoking Claude’s key. Watch the agent gracefully fall through.
  • Try a local 3B model on a Pi. See §2.6 Raspberry Pi for the setup. Quality calibration is worthwhile even if you go back to hosted models afterwards.
  • Set spending alerts at $5 / $10 / $25 thresholds on your model provider account. Cheap insurance.

What we are NOT going to claim#

Model quality rankings change with every major release. The 2026-05 picks above are best-effort at time of writing. Benchmarks for agentic (tool-using, multi-step) work are still maturing; published benchmarks usually measure single-turn tasks. Trust your own A/B tests over any single source.

Sources

See also