Gemini API — common patterns

What this page covers#

Seven patterns you’ll reach for in almost every project:

Multi-turn chat (the contents list shape)
Streaming responses
System instructions (a separate parameter)
The Files API (inputs over 100 MB, PDFs over 50 MB)
Context caching (cost savings for repeated large context)
Structured output (JSON with constrained decoding)
Safety settings

Plus a brief note on the Live API for real-time voice/video.

1. Multi-turn chat#

The simplest path uses the chats helper:

from google import genai

client = genai.Client()
chat = client.chats.create(model="gemini-2.5-flash")

print(chat.send_message("My name is Sush.").text)
print(chat.send_message("What's my name?").text)

Under the hood, chats accumulates contents and resends them each turn. For more control, build the contents list yourself:

from google.genai import types

contents = [
    types.Content(role="user",  parts=[types.Part(text="Hi")]),
    types.Content(role="model", parts=[types.Part(text="Hello!")]),
    types.Content(role="user",  parts=[types.Part(text="What did I just say?")]),
]
response = client.models.generate_content(
    model="gemini-2.5-flash", contents=contents
)

role is "user" or "model" (not "assistant" like OpenAI). System messages don’t go in contents — see pattern 3.

2. Streaming responses#

for chunk in client.models.generate_content_stream(
    model="gemini-2.5-flash",
    contents="Tell me a story in 300 words."
):
    print(chunk.text, end="", flush=True)

Underlying transport is Server-Sent Events for the REST API. The SDK gives you a Python iterator (or async iterator). For real-time voice / video, you want the Live API instead — see the bottom of this page.

3. System instructions#

System instructions go in a dedicated parameter, not as the first message in contents:

from google.genai import types

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Why is the sky blue?",
    config=types.GenerateContentConfig(
        system_instruction="Respond in exactly one sentence.",
        temperature=0.3,
        max_output_tokens=100,
    ),
)

Why this matters:

It persists across turns in a chat without taking a slot in contents
It’s not subject to user-injected prompt mutation (cleaner separation)
Switching system instructions mid-conversation is a single config change

This contrasts with OpenAI’s pattern (system message as the first item in messages); migrating from OpenAI requires lifting the system message out into the config.

4. Files API (over 100 MB / PDFs over 50 MB)#

Inline base64 encoding is fine for small files. For large media, use the Files API:

client = genai.Client()

# Upload once
file = client.files.upload(file="big-presentation.pdf")
# file.uri is now a gs:// reference Gemini can read

response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=[file, "Summarise the key arguments."],
)
print(response.text)

Key facts:

Files persist for 48 hours then auto-delete; or call client.files.delete(file.name)
Per-file max: 2 GB (general Files API limit); 50 MB threshold for inline PDFs; 100 MB total request size threshold
Supported types: PDFs, images, audio, video, generic documents

For repeated reuse of a large input, also see context caching (next).

5. Context caching (cost optimisation for repeated context)#

If you’re feeding the same large input (a 200K-token codebase, a long PDF, a 1M-token book) into many prompts, cache it once:

from google.genai import types

cache = client.caches.create(
    model="gemini-2.5-flash",
    config=types.CreateCachedContentConfig(
        contents=[file],          # the big context
        system_instruction="You are a helpful book reviewer.",
        ttl="3600s",              # 1 hour
    ),
)

# Subsequent calls reference the cache
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="What's the main argument of chapter 3?",
    config=types.GenerateContentConfig(cached_content=cache.name),
)

Pricing for gemini-2.5-flash (as of mid-2026):

Cache storage: $1.00 per 1M tokens per hour
Cache read: $0.03 per 1M tokens (text/image/video) / $0.10 per 1M tokens (audio)

For workloads that re-read 100K+ tokens of context per request, caching can drop your bill by an order of magnitude.

6. Structured output (JSON with guaranteed schema)#

Set response_mime_type and response_json_schema. Gemini uses constrained decoding — the docs describe this as returning schema-valid JSON (rather than best-effort prompt-based JSON).

Python with Pydantic (cleanest)#

from google.genai import types
from pydantic import BaseModel
from typing import List, Optional

class Ingredient(BaseModel):
    name: str
    quantity: str

class Recipe(BaseModel):
    recipe_name: str
    prep_time_minutes: Optional[int]
    ingredients: List[Ingredient]
    instructions: List[str]

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Extract this recipe text: ...",
    config=types.GenerateContentConfig(
        response_mime_type="application/json",
        response_json_schema=Recipe.model_json_schema(),
    ),
)
recipe = Recipe.model_validate_json(response.text)

TypeScript with Zod#

import { GoogleGenAI } from "@google/genai";
import { z } from "zod";
import { zodToJsonSchema } from "zod-to-json-schema";

const recipeSchema = z.object({
  recipe_name: z.string(),
  ingredients: z.array(z.object({ name: z.string(), quantity: z.string() })),
  instructions: z.array(z.string()),
});

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash",
  contents: "Extract this recipe: ...",
  config: {
    responseFormat: {
      text: {
        mimeType: "application/json",
        schema: zodToJsonSchema(recipeSchema),
      },
    },
  },
});
const recipe = recipeSchema.parse(JSON.parse(response.text));

Without a schema (just JSON mode)#

config=types.GenerateContentConfig(response_mime_type="application/json")

Output is JSON; structure is up to the model. Useful when the shape varies per request.

7. Safety settings#

Four adjustable categories:

HARM_CATEGORY_HARASSMENT
HARM_CATEGORY_HATE_SPEECH
HARM_CATEGORY_SEXUALLY_EXPLICIT
HARM_CATEGORY_DANGEROUS_CONTENT

Five block thresholds:

API value	AI Studio label	Blocks
`OFF`	Off	Nothing
`BLOCK_NONE`	Block none	Always show
`BLOCK_ONLY_HIGH`	Block few	Only HIGH probability
`BLOCK_MEDIUM_AND_ABOVE`	Block some	MEDIUM + HIGH
`BLOCK_LOW_AND_ABOVE`	Block most	LOW + MEDIUM + HIGH

from google.genai import types

config = types.GenerateContentConfig(
    safety_settings=[
        types.SafetySetting(
            category=types.HarmCategory.HARM_CATEGORY_HARASSMENT,
            threshold=types.HarmBlockThreshold.BLOCK_ONLY_HIGH,
        ),
    ],
)

Key default: for Gemini 2.5 and 3 models, the adjustable filters default to OFF. Built-in non-adjustable blocks for child-safety harms remain on at all times. Filtering is probability-based, not severity-based — adjust if you find the defaults too lenient or too strict for your domain.

Live API (real-time voice + video)#

For low-latency conversational apps (think Gemini app’s voice mode), the Live API uses a WebSocket connection rather than HTTP:

Input: 16-bit PCM audio at 16 kHz, JPEG images at ≤1 fps, text
Output: 16-bit PCM audio at 24 kHz, text
Supports interruption (“barge-in”) mid-response
Function calling and Google Search grounding work over Live too
70+ languages

Models: gemini-3.1-flash-live-preview and gemini-2.5-flash-native-audio-preview-12-2025 are the current Live API models. (An earlier gemini-live-2.5-flash-preview was shut down in December 2025.) Test in AI Studio’s Realtime streaming tab before wiring it into your app.

What’s next#

§GAPI.5 Tool use — function calling, Google Search grounding, code execution
§GAPI.3 Models — pick the right model for your pattern
§GAPI.2 Getting started — if you skipped here directly

`⌘` + `K` · `/`	open search
`j`	next entry (within section)
`k`	previous entry
`g` `h`	go to home
`g` `m`	go to methodology
`?`	show this help
`esc`	close any modal