Claw field notebook
last updated 2026-05-14 edit on GitHub colophon
Anthropic / Claude API / API.4 · 3 min read

Common patterns — tool use, structured outputs, caching

Four patterns that come up over and over building on the Claude API — the tool-use loop, structured JSON outputs, prompt caching for long system prompts, and message batches for bulk async work. Code in Python + TypeScript.

Pattern 1 — Tool use#

The most-used pattern. You give Claude a list of tools (typed JSON schemas), Claude decides when to call them, you execute and feed results back. The loop continues until Claude gives a final text response.

The shape#

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["city"],
        },
    },
]

messages = [{"role": "user", "content": "What's the weather in Auckland?"}]

while True:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        tools=tools,
        messages=messages,
    )

    if response.stop_reason == "end_turn":
        # Claude is done; print the final text
        print(response.content[0].text)
        break

    if response.stop_reason == "tool_use":
        # Claude wants to call a tool
        tool_block = next(b for b in response.content if b.type == "tool_use")
        result = run_tool(tool_block.name, tool_block.input)
        # Append Claude's request AND the result, then loop
        messages.append({"role": "assistant", "content": response.content})
        messages.append({
            "role": "user",
            "content": [{
                "type": "tool_result",
                "tool_use_id": tool_block.id,
                "content": str(result),
            }],
        })

Why it loops#

Claude doesn’t run tools — you do. After it asks for get_weather, you execute it (call your weather API, etc.), then send the result back as a tool_result. Claude reads the result and decides what’s next: another tool call, or a final answer.

Why this matters#

This is the foundation of agentic work. Every “agent that does things” — Claude Code, Cursor’s agent mode, an MCP server — is built on this loop. Once you understand it, every agent surface looks like the same protocol with different UIs around it.

Gotchas#

  • Always include the assistant’s tool_use response in the next message. Without it, you’ve lost the conversation thread.
  • tool_result content must be a string (or a list of content blocks). If your tool returns a complex object, json.dumps() it first.
  • Multiple tool calls per turn. Claude can return more than one tool_use block. Loop over all of them.

Pattern 2 — Structured outputs#

When you need the model’s response in a specific JSON shape — for downstream parsing, storing in a DB, feeding into another API.

The naive way (sometimes works)#

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=500,
    messages=[{
        "role": "user",
        "content": "Return JSON: {\"sentiment\": \"positive\"|\"negative\", \"score\": 0..1}. Text: 'This product is amazing.'"
    }],
)

import json
data = json.loads(response.content[0].text)

This works most of the time. It fails when Claude adds prose around the JSON (“Here’s the result:”), uses different field names, or wraps it in markdown code fences.

The reliable way: structured outputs via output_config#

The Claude API has a dedicated structured-outputs surface (different shape from the OpenAI API’s response_format if you’re coming from there). Set output_config.format with a json_schema:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=500,
    output_config={
        "format": {
            "type": "json_schema",
            "json_schema": {
                "type": "object",
                "properties": {
                    "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
                    "score": {"type": "number", "minimum": 0, "maximum": 1},
                },
                "required": ["sentiment", "score"],
            },
        },
    },
    messages=[{"role": "user", "content": "..."}],
)

The API constrains the output to match the schema and validates before returning. You should still defensively handle errors and the occasional refusal — “constrained” isn’t “infallible” — but the structural-correctness problem largely goes away.

This is the right default for production. Spend the 5 minutes writing the schema; save the parsing headaches forever.

⚠️ If you’re cribbing from OpenAI examples, the parameter is response_format over there. On Claude’s API it’s output_config with a nested format. Don’t mix them up — response_format is not a Claude API parameter and the call will fail.

Pattern 3 — Prompt caching#

If your system prompt is large (long instructions, big knowledge dump, complex tool definitions) AND you call the API many times with the same prefix, prompt caching makes the input ~10× cheaper on subsequent calls.

How to mark a cache breakpoint#

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=500,
    system=[
        {
            "type": "text",
            "text": "You are a customer support agent. Here are 50KB of FAQ docs: ...",
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[{"role": "user", "content": "How do I cancel my subscription?"}],
)

The system prompt section with cache_control is hashed and cached server-side. The next call within the cache lifetime (~5 minutes by default) reuses the cache; only the new tokens (the user message) are billed at the full input rate.

When it pays off#

  • Long system prompt + many calls (chatbot with 20KB of instructions, called thousands of times)
  • Tool definitions that don’t change (cache the tool array)
  • RAG with stable context (you injected the same 10K of docs into 50 questions)

When it doesn’t#

  • Conversations where the system prompt changes per call
  • Single-shot scripts that make one request

Cost math#

Cache writes cost roughly 1.25× the base input price (5-minute default lifetime). Cache reads are roughly 0.1× the base input price (i.e. ~10%). So break-even is typically at 2 calls hitting the same cache — once you’ve paid the write premium, every read after that is a significant saving. Anything beyond two calls is pure win.

Prompt-cache pricing changes occasionally. Verify the current write/read multipliers on anthropic.com/pricing before banking on the math.

Pattern 4 — Message batches (async bulk work)#

For when you have many prompts to run and they’re not time-sensitive.

The flow#

batch = client.messages.batches.create(
    requests=[
        {"custom_id": "ticket_1", "params": {"model": "claude-sonnet-4-6", "max_tokens": 200, "messages": [{"role": "user", "content": "Classify: ..."}]}},
        {"custom_id": "ticket_2", "params": {"model": "claude-sonnet-4-6", "max_tokens": 200, "messages": [{"role": "user", "content": "Classify: ..."}]}},
        # ... up to 100,000 requests or 256 MB per batch (whichever you hit first)
    ],
)

# Poll for completion (typically minutes to a few hours)
while True:
    status = client.messages.batches.retrieve(batch.id)
    if status.processing_status == "ended":
        break
    time.sleep(60)

# Stream the results (available for 29 days after creation)
for result in client.messages.batches.results(batch.id):
    print(result.custom_id, result.result.message.content[0].text)

Why use batches#

Batch calls are billed at ~50% off the non-batch rate. For bulk classification, summarisation, or any non-interactive work, this is a real cost lever.

Tradeoffs#

  • Slower. Batches can take minutes to hours to complete.
  • No streaming. You wait for the whole thing.
  • No mid-batch reaction. Can’t adjust based on early results.

If your workload tolerates latency, the savings are significant.

What to do next#

Sources