Gemini 2.0 Multimodal API: Image, Audio and Video in One Call

What Gemini 2.0 Multimodal Gives You

Most LLM APIs make you choose: text OR image OR audio. Gemini 2.0 Flash accepts all three in a single request — and returns grounded, cross-modal answers.

That matters when you're building apps that need to reason across modalities: "What did the speaker say, and does the slide behind them match?" or "Transcribe this call recording and flag anything that contradicts the attached contract."

You'll learn:

How to send images, audio files, and video to Gemini 2.0 in one API call
When to use inline base64 vs the File API for large media
How to stream multimodal responses for real-time UX

Time: 20 min | Difficulty: Intermediate

How Gemini 2.0 Multimodal Works

Gemini 2.0 Flash is a natively multimodal model — it was trained on interleaved text, image, audio, and video tokens. You don't call separate endpoints or chain models. One /generateContent request accepts a parts array containing any mix of:

User message
├── text part       → "Describe what's happening"
├── image part      → JPEG / PNG / WebP / GIF (inline or File API)
├── audio part      → MP3 / WAV / FLAC / OGG (File API recommended)
└── video part      → MP4 / MOV / AVI (File API required)

The model attends to all parts together. There is no separate fusion step you manage.

Context window: Gemini 2.0 Flash supports 1M tokens. A 1-hour video at 1fps ≈ 3,600 frames ≈ ~700k tokens. Stay under 20 min of video for comfortable headroom alongside your text prompt.

Setup

pip install google-genai

import os
from google import genai
from google.genai import types

client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
MODEL = "gemini-2.0-flash"  # gemini-2.0-pro available for higher quality

Get your API key at aistudio.google.com. The free tier covers 15 requests/min — enough for development.

Solution

Step 1: Send an Image Inline

Small images (under ~4MB) can go inline as base64. No upload step needed.

import base64
from pathlib import Path

def encode_image(path: str) -> str:
    # base64-encode so the API can receive it in a JSON payload
    return base64.b64encode(Path(path).read_bytes()).decode("utf-8")

response = client.models.generate_content(
    model=MODEL,
    contents=[
        types.Content(parts=[
            types.Part(text="What's wrong with the UI in this screenshot? Be specific."),
            types.Part(
                inline_data=types.Blob(
                    mime_type="image/png",
                    data=encode_image("screenshot.png"),
                )
            ),
        ])
    ],
)

print(response.text)

Expected output: A specific critique of the UI, referencing actual elements in the image.

If it fails:

400 Invalid MIME type → Gemini 2.0 accepts image/png, image/jpeg, image/webp, image/gif. Convert other formats first.
413 Request too large → Image exceeds inline limit. Use the File API (Step 3).

Step 2: Send Audio Inline

Audio under 20MB can go inline. Gemini 2.0 supports transcription, speaker diarization, and semantic analysis in one shot.

response = client.models.generate_content(
    model=MODEL,
    contents=[
        types.Content(parts=[
            types.Part(text="Transcribe this recording. Then summarize the three main topics."),
            types.Part(
                inline_data=types.Blob(
                    mime_type="audio/mp3",
                    data=encode_image("meeting_clip.mp3"),  # same base64 helper works
                )
            ),
        ])
    ],
)

print(response.text)

Supported audio formats: audio/mp3, audio/wav, audio/flac, audio/ogg, audio/aac, audio/webm

Rate limit note: Audio tokens count toward context. A 10-minute MP3 at 16kHz ≈ 10k tokens. Keep clips under 30 min for single requests.

Step 3: Upload Large Files with the File API

Files over 20MB — or any video — must go through the File API. Files are stored for 48 hours; you reuse the URI across requests.

import time

def upload_file(path: str, mime_type: str):
    """Upload and wait until the file is ACTIVE before using it."""
    file = client.files.upload(
        path=path,
        config=types.UploadFileConfig(mime_type=mime_type),
    )

    # File processing is async — poll until state is ACTIVE
    while file.state == "PROCESSING":
        time.sleep(2)
        file = client.files.get(name=file.name)

    if file.state == "FAILED":
        raise RuntimeError(f"File processing failed: {file.name}")

    return file

# Upload a video once, reuse the URI in multiple prompts
video_file = upload_file("demo.mp4", "video/mp4")

response = client.models.generate_content(
    model=MODEL,
    contents=[
        types.Content(parts=[
            types.Part(text="List every UI component shown in this demo video with timestamps."),
            types.Part(
                file_data=types.FileData(
                    mime_type=video_file.mime_type,
                    file_uri=video_file.uri,
                )
            ),
        ])
    ],
)

print(response.text)

Expected output: A timestamped list like [0:03] Navigation bar — [0:12] Modal dialog — [0:31] Data table

If it fails:

File state: FAILED → Corrupted file or unsupported codec. Re-encode with ffmpeg -i input.mp4 -c:v libx264 output.mp4
File not found → Files expire after 48 hours. Re-upload.

Step 4: Combine Modalities in One Request

This is where Gemini 2.0 earns its keep. Mix image, audio, and text parts freely.

# Scenario: check if a sales call recording matches the product slide deck
slide_image = upload_file("product_slide.png", "image/png")
call_audio = upload_file("sales_call.mp3", "audio/mp3")

response = client.models.generate_content(
    model=MODEL,
    contents=[
        types.Content(parts=[
            types.Part(
                text=(
                    "Listen to the sales call. Look at the product slide. "
                    "Identify any claims the rep made that are NOT supported by the slide. "
                    "Format as a bullet list."
                )
            ),
            types.Part(
                file_data=types.FileData(
                    mime_type=slide_image.mime_type,
                    file_uri=slide_image.uri,
                )
            ),
            types.Part(
                file_data=types.FileData(
                    mime_type=call_audio.mime_type,
                    file_uri=call_audio.uri,
                )
            ),
        ])
    ],
)

print(response.text)

No orchestration between models. No intermediate outputs to parse. One call returns the cross-modal analysis.

Step 5: Stream the Response

For long video analysis, streaming prevents UI timeouts and lets you show progress.

# stream=True on generate_content returns a generator
for chunk in client.models.generate_content_stream(
    model=MODEL,
    contents=[
        types.Content(parts=[
            types.Part(text="Give a detailed chapter breakdown of this lecture video."),
            types.Part(
                file_data=types.FileData(
                    mime_type=video_file.mime_type,
                    file_uri=video_file.uri,
                )
            ),
        ])
    ],
):
    print(chunk.text, end="", flush=True)

Chunks arrive as the model generates — typically within 1–2 seconds for the first token on video inputs.

Verification

Run this end-to-end test with a small local image:

import urllib.request, base64

# Grab a public test image
urllib.request.urlretrieve(
    "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png",
    "test.png"
)

response = client.models.generate_content(
    model=MODEL,
    contents=[
        types.Content(parts=[
            types.Part(text="Describe this image in one sentence."),
            types.Part(
                inline_data=types.Blob(
                    mime_type="image/png",
                    data=base64.b64encode(open("test.png", "rb").read()).decode(),
                )
            ),
        ])
    ],
)

assert len(response.text) > 10, "Got empty response"
print("✅ Multimodal API working:", response.text)

You should see: ✅ Multimodal API working: [description of the image]

Token Cost Reference

Input type	Approx tokens	Notes
1 image (any size)	258 tokens	Fixed cost per image
1 min audio	~1,000 tokens
1 min video (1fps)	~1,800 tokens	Frames + audio track
1 min video (default fps)	~3,600 tokens	Gemini samples at up to 1fps

Gemini 2.0 Flash pricing as of March 2026: $0.075 / 1M input tokens. A 10-min video analysis ≈ 36k tokens ≈ ~$0.003 per call.

What You Learned

Gemini 2.0 Flash accepts image, audio, and video in a single parts array — no multi-step pipeline needed
Use inline base64 for files under 20MB; use the File API for video and larger assets
The File API is async — always poll for ACTIVE state before sending the file URI in a prompt
Streaming (generate_content_stream) is essential for video inputs longer than ~2 minutes

Limitation: Gemini 2.0 doesn't support real-time audio streaming in the standard REST API — that's the Live API (separate endpoint, WebSocket-based). Use the approach above for pre-recorded media.

Tested on google-genai 1.7.0, Python 3.12, Gemini 2.0 Flash (gemini-2.0-flash)