What Gemini 2.0 Multimodal Gives You
Most LLM APIs make you choose: text OR image OR audio. Gemini 2.0 Flash accepts all three in a single request — and returns grounded, cross-modal answers.
That matters when you're building apps that need to reason across modalities: "What did the speaker say, and does the slide behind them match?" or "Transcribe this call recording and flag anything that contradicts the attached contract."
You'll learn:
- How to send images, audio files, and video to Gemini 2.0 in one API call
- When to use inline base64 vs the File API for large media
- How to stream multimodal responses for real-time UX
Time: 20 min | Difficulty: Intermediate
How Gemini 2.0 Multimodal Works
Gemini 2.0 Flash is a natively multimodal model — it was trained on interleaved text, image, audio, and video tokens. You don't call separate endpoints or chain models. One /generateContent request accepts a parts array containing any mix of:
User message
├── text part → "Describe what's happening"
├── image part → JPEG / PNG / WebP / GIF (inline or File API)
├── audio part → MP3 / WAV / FLAC / OGG (File API recommended)
└── video part → MP4 / MOV / AVI (File API required)
The model attends to all parts together. There is no separate fusion step you manage.
Context window: Gemini 2.0 Flash supports 1M tokens. A 1-hour video at 1fps ≈ 3,600 frames ≈ ~700k tokens. Stay under 20 min of video for comfortable headroom alongside your text prompt.
Setup
pip install google-genai
import os
from google import genai
from google.genai import types
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
MODEL = "gemini-2.0-flash" # gemini-2.0-pro available for higher quality
Get your API key at aistudio.google.com. The free tier covers 15 requests/min — enough for development.
Solution
Step 1: Send an Image Inline
Small images (under ~4MB) can go inline as base64. No upload step needed.
import base64
from pathlib import Path
def encode_image(path: str) -> str:
# base64-encode so the API can receive it in a JSON payload
return base64.b64encode(Path(path).read_bytes()).decode("utf-8")
response = client.models.generate_content(
model=MODEL,
contents=[
types.Content(parts=[
types.Part(text="What's wrong with the UI in this screenshot? Be specific."),
types.Part(
inline_data=types.Blob(
mime_type="image/png",
data=encode_image("screenshot.png"),
)
),
])
],
)
print(response.text)
Expected output: A specific critique of the UI, referencing actual elements in the image.
If it fails:
400 Invalid MIME type→ Gemini 2.0 acceptsimage/png,image/jpeg,image/webp,image/gif. Convert other formats first.413 Request too large→ Image exceeds inline limit. Use the File API (Step 3).
Step 2: Send Audio Inline
Audio under 20MB can go inline. Gemini 2.0 supports transcription, speaker diarization, and semantic analysis in one shot.
response = client.models.generate_content(
model=MODEL,
contents=[
types.Content(parts=[
types.Part(text="Transcribe this recording. Then summarize the three main topics."),
types.Part(
inline_data=types.Blob(
mime_type="audio/mp3",
data=encode_image("meeting_clip.mp3"), # same base64 helper works
)
),
])
],
)
print(response.text)
Supported audio formats: audio/mp3, audio/wav, audio/flac, audio/ogg, audio/aac, audio/webm
Rate limit note: Audio tokens count toward context. A 10-minute MP3 at 16kHz ≈ 10k tokens. Keep clips under 30 min for single requests.
Step 3: Upload Large Files with the File API
Files over 20MB — or any video — must go through the File API. Files are stored for 48 hours; you reuse the URI across requests.
import time
def upload_file(path: str, mime_type: str):
"""Upload and wait until the file is ACTIVE before using it."""
file = client.files.upload(
path=path,
config=types.UploadFileConfig(mime_type=mime_type),
)
# File processing is async — poll until state is ACTIVE
while file.state == "PROCESSING":
time.sleep(2)
file = client.files.get(name=file.name)
if file.state == "FAILED":
raise RuntimeError(f"File processing failed: {file.name}")
return file
# Upload a video once, reuse the URI in multiple prompts
video_file = upload_file("demo.mp4", "video/mp4")
response = client.models.generate_content(
model=MODEL,
contents=[
types.Content(parts=[
types.Part(text="List every UI component shown in this demo video with timestamps."),
types.Part(
file_data=types.FileData(
mime_type=video_file.mime_type,
file_uri=video_file.uri,
)
),
])
],
)
print(response.text)
Expected output: A timestamped list like [0:03] Navigation bar — [0:12] Modal dialog — [0:31] Data table
If it fails:
File state: FAILED→ Corrupted file or unsupported codec. Re-encode withffmpeg -i input.mp4 -c:v libx264 output.mp4File not found→ Files expire after 48 hours. Re-upload.
Step 4: Combine Modalities in One Request
This is where Gemini 2.0 earns its keep. Mix image, audio, and text parts freely.
# Scenario: check if a sales call recording matches the product slide deck
slide_image = upload_file("product_slide.png", "image/png")
call_audio = upload_file("sales_call.mp3", "audio/mp3")
response = client.models.generate_content(
model=MODEL,
contents=[
types.Content(parts=[
types.Part(
text=(
"Listen to the sales call. Look at the product slide. "
"Identify any claims the rep made that are NOT supported by the slide. "
"Format as a bullet list."
)
),
types.Part(
file_data=types.FileData(
mime_type=slide_image.mime_type,
file_uri=slide_image.uri,
)
),
types.Part(
file_data=types.FileData(
mime_type=call_audio.mime_type,
file_uri=call_audio.uri,
)
),
])
],
)
print(response.text)
No orchestration between models. No intermediate outputs to parse. One call returns the cross-modal analysis.
Step 5: Stream the Response
For long video analysis, streaming prevents UI timeouts and lets you show progress.
# stream=True on generate_content returns a generator
for chunk in client.models.generate_content_stream(
model=MODEL,
contents=[
types.Content(parts=[
types.Part(text="Give a detailed chapter breakdown of this lecture video."),
types.Part(
file_data=types.FileData(
mime_type=video_file.mime_type,
file_uri=video_file.uri,
)
),
])
],
):
print(chunk.text, end="", flush=True)
Chunks arrive as the model generates — typically within 1–2 seconds for the first token on video inputs.
Verification
Run this end-to-end test with a small local image:
import urllib.request, base64
# Grab a public test image
urllib.request.urlretrieve(
"https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png",
"test.png"
)
response = client.models.generate_content(
model=MODEL,
contents=[
types.Content(parts=[
types.Part(text="Describe this image in one sentence."),
types.Part(
inline_data=types.Blob(
mime_type="image/png",
data=base64.b64encode(open("test.png", "rb").read()).decode(),
)
),
])
],
)
assert len(response.text) > 10, "Got empty response"
print("✅ Multimodal API working:", response.text)
You should see: ✅ Multimodal API working: [description of the image]
Token Cost Reference
| Input type | Approx tokens | Notes |
|---|---|---|
| 1 image (any size) | 258 tokens | Fixed cost per image |
| 1 min audio | ~1,000 tokens | |
| 1 min video (1fps) | ~1,800 tokens | Frames + audio track |
| 1 min video (default fps) | ~3,600 tokens | Gemini samples at up to 1fps |
Gemini 2.0 Flash pricing as of March 2026: $0.075 / 1M input tokens. A 10-min video analysis ≈ 36k tokens ≈ ~$0.003 per call.
What You Learned
- Gemini 2.0 Flash accepts image, audio, and video in a single
partsarray — no multi-step pipeline needed - Use inline base64 for files under 20MB; use the File API for video and larger assets
- The File API is async — always poll for
ACTIVEstate before sending the file URI in a prompt - Streaming (
generate_content_stream) is essential for video inputs longer than ~2 minutes
Limitation: Gemini 2.0 doesn't support real-time audio streaming in the standard REST API — that's the Live API (separate endpoint, WebSocket-based). Use the approach above for pre-recorded media.
Tested on google-genai 1.7.0, Python 3.12, Gemini 2.0 Flash (gemini-2.0-flash)