Gemini 2.0 Multimodal API: Image, Audio and Video in One Call

Use Gemini 2.0 Flash to process images, audio, and video in a single API call. Includes Python code, file upload, and streaming examples.

What Gemini 2.0 Multimodal Gives You

Most LLM APIs make you choose: text OR image OR audio. Gemini 2.0 Flash accepts all three in a single request — and returns grounded, cross-modal answers.

That matters when you're building apps that need to reason across modalities: "What did the speaker say, and does the slide behind them match?" or "Transcribe this call recording and flag anything that contradicts the attached contract."

You'll learn:

  • How to send images, audio files, and video to Gemini 2.0 in one API call
  • When to use inline base64 vs the File API for large media
  • How to stream multimodal responses for real-time UX

Time: 20 min | Difficulty: Intermediate


How Gemini 2.0 Multimodal Works

Gemini 2.0 Flash is a natively multimodal model — it was trained on interleaved text, image, audio, and video tokens. You don't call separate endpoints or chain models. One /generateContent request accepts a parts array containing any mix of:

User message
├── text part       → "Describe what's happening"
├── image part      → JPEG / PNG / WebP / GIF (inline or File API)
├── audio part      → MP3 / WAV / FLAC / OGG (File API recommended)
└── video part      → MP4 / MOV / AVI (File API required)

The model attends to all parts together. There is no separate fusion step you manage.

Context window: Gemini 2.0 Flash supports 1M tokens. A 1-hour video at 1fps ≈ 3,600 frames ≈ ~700k tokens. Stay under 20 min of video for comfortable headroom alongside your text prompt.


Setup

pip install google-genai
import os
from google import genai
from google.genai import types

client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
MODEL = "gemini-2.0-flash"  # gemini-2.0-pro available for higher quality

Get your API key at aistudio.google.com. The free tier covers 15 requests/min — enough for development.


Solution

Step 1: Send an Image Inline

Small images (under ~4MB) can go inline as base64. No upload step needed.

import base64
from pathlib import Path

def encode_image(path: str) -> str:
    # base64-encode so the API can receive it in a JSON payload
    return base64.b64encode(Path(path).read_bytes()).decode("utf-8")

response = client.models.generate_content(
    model=MODEL,
    contents=[
        types.Content(parts=[
            types.Part(text="What's wrong with the UI in this screenshot? Be specific."),
            types.Part(
                inline_data=types.Blob(
                    mime_type="image/png",
                    data=encode_image("screenshot.png"),
                )
            ),
        ])
    ],
)

print(response.text)

Expected output: A specific critique of the UI, referencing actual elements in the image.

If it fails:

  • 400 Invalid MIME typeGemini 2.0 accepts image/png, image/jpeg, image/webp, image/gif. Convert other formats first.
  • 413 Request too large → Image exceeds inline limit. Use the File API (Step 3).

Step 2: Send Audio Inline

Audio under 20MB can go inline. Gemini 2.0 supports transcription, speaker diarization, and semantic analysis in one shot.

response = client.models.generate_content(
    model=MODEL,
    contents=[
        types.Content(parts=[
            types.Part(text="Transcribe this recording. Then summarize the three main topics."),
            types.Part(
                inline_data=types.Blob(
                    mime_type="audio/mp3",
                    data=encode_image("meeting_clip.mp3"),  # same base64 helper works
                )
            ),
        ])
    ],
)

print(response.text)

Supported audio formats: audio/mp3, audio/wav, audio/flac, audio/ogg, audio/aac, audio/webm

Rate limit note: Audio tokens count toward context. A 10-minute MP3 at 16kHz ≈ 10k tokens. Keep clips under 30 min for single requests.


Step 3: Upload Large Files with the File API

Files over 20MB — or any video — must go through the File API. Files are stored for 48 hours; you reuse the URI across requests.

import time

def upload_file(path: str, mime_type: str):
    """Upload and wait until the file is ACTIVE before using it."""
    file = client.files.upload(
        path=path,
        config=types.UploadFileConfig(mime_type=mime_type),
    )

    # File processing is async — poll until state is ACTIVE
    while file.state == "PROCESSING":
        time.sleep(2)
        file = client.files.get(name=file.name)

    if file.state == "FAILED":
        raise RuntimeError(f"File processing failed: {file.name}")

    return file

# Upload a video once, reuse the URI in multiple prompts
video_file = upload_file("demo.mp4", "video/mp4")

response = client.models.generate_content(
    model=MODEL,
    contents=[
        types.Content(parts=[
            types.Part(text="List every UI component shown in this demo video with timestamps."),
            types.Part(
                file_data=types.FileData(
                    mime_type=video_file.mime_type,
                    file_uri=video_file.uri,
                )
            ),
        ])
    ],
)

print(response.text)

Expected output: A timestamped list like [0:03] Navigation bar — [0:12] Modal dialog — [0:31] Data table

If it fails:

  • File state: FAILED → Corrupted file or unsupported codec. Re-encode with ffmpeg -i input.mp4 -c:v libx264 output.mp4
  • File not found → Files expire after 48 hours. Re-upload.

Step 4: Combine Modalities in One Request

This is where Gemini 2.0 earns its keep. Mix image, audio, and text parts freely.

# Scenario: check if a sales call recording matches the product slide deck
slide_image = upload_file("product_slide.png", "image/png")
call_audio = upload_file("sales_call.mp3", "audio/mp3")

response = client.models.generate_content(
    model=MODEL,
    contents=[
        types.Content(parts=[
            types.Part(
                text=(
                    "Listen to the sales call. Look at the product slide. "
                    "Identify any claims the rep made that are NOT supported by the slide. "
                    "Format as a bullet list."
                )
            ),
            types.Part(
                file_data=types.FileData(
                    mime_type=slide_image.mime_type,
                    file_uri=slide_image.uri,
                )
            ),
            types.Part(
                file_data=types.FileData(
                    mime_type=call_audio.mime_type,
                    file_uri=call_audio.uri,
                )
            ),
        ])
    ],
)

print(response.text)

No orchestration between models. No intermediate outputs to parse. One call returns the cross-modal analysis.


Step 5: Stream the Response

For long video analysis, streaming prevents UI timeouts and lets you show progress.

# stream=True on generate_content returns a generator
for chunk in client.models.generate_content_stream(
    model=MODEL,
    contents=[
        types.Content(parts=[
            types.Part(text="Give a detailed chapter breakdown of this lecture video."),
            types.Part(
                file_data=types.FileData(
                    mime_type=video_file.mime_type,
                    file_uri=video_file.uri,
                )
            ),
        ])
    ],
):
    print(chunk.text, end="", flush=True)

Chunks arrive as the model generates — typically within 1–2 seconds for the first token on video inputs.


Verification

Run this end-to-end test with a small local image:

import urllib.request, base64

# Grab a public test image
urllib.request.urlretrieve(
    "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png",
    "test.png"
)

response = client.models.generate_content(
    model=MODEL,
    contents=[
        types.Content(parts=[
            types.Part(text="Describe this image in one sentence."),
            types.Part(
                inline_data=types.Blob(
                    mime_type="image/png",
                    data=base64.b64encode(open("test.png", "rb").read()).decode(),
                )
            ),
        ])
    ],
)

assert len(response.text) > 10, "Got empty response"
print("✅ Multimodal API working:", response.text)

You should see: ✅ Multimodal API working: [description of the image]


Token Cost Reference

Input typeApprox tokensNotes
1 image (any size)258 tokensFixed cost per image
1 min audio~1,000 tokens
1 min video (1fps)~1,800 tokensFrames + audio track
1 min video (default fps)~3,600 tokensGemini samples at up to 1fps

Gemini 2.0 Flash pricing as of March 2026: $0.075 / 1M input tokens. A 10-min video analysis ≈ 36k tokens ≈ ~$0.003 per call.


What You Learned

  • Gemini 2.0 Flash accepts image, audio, and video in a single parts array — no multi-step pipeline needed
  • Use inline base64 for files under 20MB; use the File API for video and larger assets
  • The File API is async — always poll for ACTIVE state before sending the file URI in a prompt
  • Streaming (generate_content_stream) is essential for video inputs longer than ~2 minutes

Limitation: Gemini 2.0 doesn't support real-time audio streaming in the standard REST API — that's the Live API (separate endpoint, WebSocket-based). Use the approach above for pre-recorded media.

Tested on google-genai 1.7.0, Python 3.12, Gemini 2.0 Flash (gemini-2.0-flash)