Claw field notebook
last updated 2026-05-15 edit on GitHub colophon
Google / Gemini API / GAPI.4 · 3 min read

Gemini API — common patterns

Multi-turn chat, streaming, system instructions (a separate parameter, not the first message), Files API for inputs over 100 MB, context caching for cost savings on repeated context, structured output with JSON schemas, and safety settings.

What this page covers#

Seven patterns you’ll reach for in almost every project:

  1. Multi-turn chat (the contents list shape)
  2. Streaming responses
  3. System instructions (a separate parameter)
  4. The Files API (inputs over 100 MB, PDFs over 50 MB)
  5. Context caching (cost savings for repeated large context)
  6. Structured output (JSON with constrained decoding)
  7. Safety settings

Plus a brief note on the Live API for real-time voice/video.

1. Multi-turn chat#

The simplest path uses the chats helper:

from google import genai

client = genai.Client()
chat = client.chats.create(model="gemini-2.5-flash")

print(chat.send_message("My name is Sush.").text)
print(chat.send_message("What's my name?").text)

Under the hood, chats accumulates contents and resends them each turn. For more control, build the contents list yourself:

from google.genai import types

contents = [
    types.Content(role="user",  parts=[types.Part(text="Hi")]),
    types.Content(role="model", parts=[types.Part(text="Hello!")]),
    types.Content(role="user",  parts=[types.Part(text="What did I just say?")]),
]
response = client.models.generate_content(
    model="gemini-2.5-flash", contents=contents
)

role is "user" or "model" (not "assistant" like OpenAI). System messages don’t go in contents — see pattern 3.

2. Streaming responses#

for chunk in client.models.generate_content_stream(
    model="gemini-2.5-flash",
    contents="Tell me a story in 300 words."
):
    print(chunk.text, end="", flush=True)

Underlying transport is Server-Sent Events for the REST API. The SDK gives you a Python iterator (or async iterator). For real-time voice / video, you want the Live API instead — see the bottom of this page.

3. System instructions#

System instructions go in a dedicated parameter, not as the first message in contents:

from google.genai import types

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Why is the sky blue?",
    config=types.GenerateContentConfig(
        system_instruction="Respond in exactly one sentence.",
        temperature=0.3,
        max_output_tokens=100,
    ),
)

Why this matters:

  • It persists across turns in a chat without taking a slot in contents
  • It’s not subject to user-injected prompt mutation (cleaner separation)
  • Switching system instructions mid-conversation is a single config change

This contrasts with OpenAI’s pattern (system message as the first item in messages); migrating from OpenAI requires lifting the system message out into the config.

4. Files API (over 100 MB / PDFs over 50 MB)#

Inline base64 encoding is fine for small files. For large media, use the Files API:

client = genai.Client()

# Upload once
file = client.files.upload(file="big-presentation.pdf")
# file.uri is now a gs:// reference Gemini can read

response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=[file, "Summarise the key arguments."],
)
print(response.text)

Key facts:

  • Files persist for 48 hours then auto-delete; or call client.files.delete(file.name)
  • Per-file max: 2 GB (general Files API limit); 50 MB threshold for inline PDFs; 100 MB total request size threshold
  • Supported types: PDFs, images, audio, video, generic documents

For repeated reuse of a large input, also see context caching (next).

5. Context caching (cost optimisation for repeated context)#

If you’re feeding the same large input (a 200K-token codebase, a long PDF, a 1M-token book) into many prompts, cache it once:

from google.genai import types

cache = client.caches.create(
    model="gemini-2.5-flash",
    config=types.CreateCachedContentConfig(
        contents=[file],          # the big context
        system_instruction="You are a helpful book reviewer.",
        ttl="3600s",              # 1 hour
    ),
)

# Subsequent calls reference the cache
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="What's the main argument of chapter 3?",
    config=types.GenerateContentConfig(cached_content=cache.name),
)

Pricing for gemini-2.5-flash (as of mid-2026):

  • Cache storage: $1.00 per 1M tokens per hour
  • Cache read: $0.03 per 1M tokens (text/image/video) / $0.10 per 1M tokens (audio)

For workloads that re-read 100K+ tokens of context per request, caching can drop your bill by an order of magnitude.

6. Structured output (JSON with guaranteed schema)#

Set response_mime_type and response_json_schema. Gemini uses constrained decoding — the docs describe this as returning schema-valid JSON (rather than best-effort prompt-based JSON).

Python with Pydantic (cleanest)#

from google.genai import types
from pydantic import BaseModel
from typing import List, Optional

class Ingredient(BaseModel):
    name: str
    quantity: str

class Recipe(BaseModel):
    recipe_name: str
    prep_time_minutes: Optional[int]
    ingredients: List[Ingredient]
    instructions: List[str]

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Extract this recipe text: ...",
    config=types.GenerateContentConfig(
        response_mime_type="application/json",
        response_json_schema=Recipe.model_json_schema(),
    ),
)
recipe = Recipe.model_validate_json(response.text)

TypeScript with Zod#

import { GoogleGenAI } from "@google/genai";
import { z } from "zod";
import { zodToJsonSchema } from "zod-to-json-schema";

const recipeSchema = z.object({
  recipe_name: z.string(),
  ingredients: z.array(z.object({ name: z.string(), quantity: z.string() })),
  instructions: z.array(z.string()),
});

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash",
  contents: "Extract this recipe: ...",
  config: {
    responseFormat: {
      text: {
        mimeType: "application/json",
        schema: zodToJsonSchema(recipeSchema),
      },
    },
  },
});
const recipe = recipeSchema.parse(JSON.parse(response.text));

Without a schema (just JSON mode)#

config=types.GenerateContentConfig(response_mime_type="application/json")

Output is JSON; structure is up to the model. Useful when the shape varies per request.

7. Safety settings#

Four adjustable categories:

  • HARM_CATEGORY_HARASSMENT
  • HARM_CATEGORY_HATE_SPEECH
  • HARM_CATEGORY_SEXUALLY_EXPLICIT
  • HARM_CATEGORY_DANGEROUS_CONTENT

Five block thresholds:

API valueAI Studio labelBlocks
OFFOffNothing
BLOCK_NONEBlock noneAlways show
BLOCK_ONLY_HIGHBlock fewOnly HIGH probability
BLOCK_MEDIUM_AND_ABOVEBlock someMEDIUM + HIGH
BLOCK_LOW_AND_ABOVEBlock mostLOW + MEDIUM + HIGH
from google.genai import types

config = types.GenerateContentConfig(
    safety_settings=[
        types.SafetySetting(
            category=types.HarmCategory.HARM_CATEGORY_HARASSMENT,
            threshold=types.HarmBlockThreshold.BLOCK_ONLY_HIGH,
        ),
    ],
)

Key default: for Gemini 2.5 and 3 models, the adjustable filters default to OFF. Built-in non-adjustable blocks for child-safety harms remain on at all times. Filtering is probability-based, not severity-based — adjust if you find the defaults too lenient or too strict for your domain.

Live API (real-time voice + video)#

For low-latency conversational apps (think Gemini app’s voice mode), the Live API uses a WebSocket connection rather than HTTP:

  • Input: 16-bit PCM audio at 16 kHz, JPEG images at ≤1 fps, text
  • Output: 16-bit PCM audio at 24 kHz, text
  • Supports interruption (“barge-in”) mid-response
  • Function calling and Google Search grounding work over Live too
  • 70+ languages

Models: gemini-3.1-flash-live-preview and gemini-2.5-flash-native-audio-preview-12-2025 are the current Live API models. (An earlier gemini-live-2.5-flash-preview was shut down in December 2025.) Test in AI Studio’s Realtime streaming tab before wiring it into your app.

What’s next#

Sources