Problem: Your AI Video Looks Great Frame-by-Frame but Falls Apart in Motion

You generated video with a diffusion model and individual frames look sharp. But play it back and objects shimmer, faces morph between frames, and backgrounds pulse randomly. This is temporal inconsistency — the model treats each frame as an independent image.

You'll learn:

Why AI video models produce flickering and how to diagnose the type
How to apply temporal smoothing at the generation stage
How to fix existing footage in post with frame-blending and optical flow

Time: 20 min | Level: Intermediate

Why This Happens

Diffusion models sample from noise independently per frame unless explicitly constrained. Even video-native models (Wan2.1, CogVideoX, Mochi) can lose coherence on long sequences or complex motion because their temporal attention windows are finite.

Common symptoms:

Objects "breathe" or pulse in size/color when stationary
Skin tones shift hue between adjacent frames
Background textures regenerate differently each frame
Hair and fine details are the worst offenders — high-frequency detail is hardest to lock

Two distinct failure modes: generative flickering (happens during inference) and compression flickering (introduced by codec encoding). They look similar but need different fixes.

Solution

Step 1: Diagnose Which Type of Flickering You Have

Before touching any settings, run a quick analysis to separate generative from codec artifacts.

# Extract frames from your output video
ffmpeg -i output.mp4 -vf fps=24 frames/frame_%04d.png

# Compute per-frame difference — high values = generative flicker
python3 - <<'EOF'
import cv2, glob, numpy as np

frames = sorted(glob.glob("frames/*.png"))
diffs = []

for i in range(1, len(frames)):
    a = cv2.imread(frames[i-1]).astype(float)
    b = cv2.imread(frames[i]).astype(float)
    diff = np.mean(np.abs(a - b))
    diffs.append((i, diff))
    print(f"Frame {i}: mean diff = {diff:.2f}")

avg = np.mean([d for _, d in diffs])
print(f"\nAverage inter-frame diff: {avg:.2f}")
print("Likely generative flicker" if avg > 12 else "Likely codec artifact")
EOF

Expected: Frame diffs over ~12 mean you have generative flicker at the source. Under 12 with visible banding usually points to codec compression. Treat both differently below.

If it fails:

ModuleNotFoundError: pip install opencv-python numpy
No frames extracted: Check your fps value matches the video's actual framerate

Step 2: Fix Generative Flickering at the Source

If you still have access to your generation pipeline, add temporal guidance before re-rendering.

For Wan2.1 / CogVideoX (Diffusers):

import torch
from diffusers import CogVideoXPipeline

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16
).to("cuda")

# Increase guidance and reduce steps slightly — faster sampling = less drift
prompt = "your prompt here"

video = pipe(
    prompt=prompt,
    num_frames=49,
    guidance_scale=7.5,       # Higher = more prompt adherence, less frame drift
    num_inference_steps=40,   # Don't go under 35; temporal coherence degrades
    generator=torch.Generator("cuda").manual_seed(42),  # Lock seed for reproducibility
).frames[0]

For Stable Video Diffusion (SVD):

from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt",
    torch_dtype=torch.float16
)
pipe.to("cuda")

image = load_image("your_keyframe.png").resize((1024, 576))

frames = pipe(
    image,
    num_frames=25,
    motion_bucket_id=100,      # Lower = less motion = more temporal stability
    noise_aug_strength=0.02,   # Critical: keep below 0.05 for stability
    decode_chunk_size=8,       # Decode in chunks to avoid memory-induced drift
).frames[0]

Why noise_aug_strength matters: Values above 0.05 inject enough noise that each frame re-samples independently, breaking temporal coherence. This is the single most common misconfiguration.

Step 3: Post-Process Existing Footage with Optical Flow

If you can't re-generate, use RIFE or FILM for optical-flow-guided temporal smoothing on existing footage. This warps frames toward their neighbors based on estimated motion, filling in the flicker.

# Install RIFE via the practical-rife wrapper
pip install rife-ncnn-vulkan

# Apply 2x temporal interpolation (adds in-between frames, reduces perceived flicker)
rife-ncnn-vulkan -i frames/ -o rife_out/ -m rife-v4.6

# Recombine to video
ffmpeg -framerate 48 -i rife_out/frame_%08d.png \
  -c:v libx264 -crf 18 -pix_fmt yuv420p \
  output_stabilized.mp4

For more aggressive smoothing using FILM (Frame Interpolation for Large Motion):

# FILM via TensorFlow Hub — best for slow motion or heavy flickering
import tensorflow as tf
import tensorflow_hub as hub

model = hub.load("https://tfhub.dev/google/film/1")

# Load adjacent frames as tensors (normalized 0-1 float32)
frame_a = tf.image.decode_png(tf.io.read_file("frames/frame_0010.png")) / 255.0
frame_b = tf.image.decode_png(tf.io.read_file("frames/frame_0012.png")) / 255.0

frame_a = tf.expand_dims(frame_a, 0)  # Add batch dim
frame_b = tf.expand_dims(frame_b, 0)

# time=0.5 interpolates exactly halfway between frames
result = model({"x0": frame_a, "x1": frame_b, "time": tf.constant([[0.5]])})
interpolated = result["image"][0].numpy()

If it fails:

RIFE GPU error: Fall back to CPU with -g -1 flag
FILM out-of-memory: Resize input frames to 1280px width max before processing

Step 4: Fix Codec Flickering with a Deband + Deflicker Pass

If your analysis showed low inter-frame diffs but visible banding, the encoder introduced it. Fix in FFmpeg without re-generating anything.

# deflicker filter averages luminance across neighboring frames
# deband removes compression banding in flat areas
ffmpeg -i output.mp4 \
  -vf "deflicker=size=5:mode=am,deband=1thr=0.02:2thr=0.02:3thr=0.02:blur=false" \
  -c:v libx264 -crf 16 -preset slow \
  output_deband.mp4

Parameter guide:

deflicker=size=5 — averages over 5 frames; increase to 9 for heavy flicker (adds motion blur)
deband=blur=false — keeps sharp edges; set true if you prefer smoother gradients
crf 16 — high quality re-encode; don't go above 20 or you re-introduce compression artifacts

Verification

# Run your diff script again on the processed output
python3 analyze_frames.py output_stabilized.mp4

# Visual sanity check — compare side by side at 1/4 speed
ffmpeg -i output.mp4 -i output_stabilized.mp4 \
  -filter_complex "[0:v]setpts=4*PTS[a];[1:v]setpts=4*PTS[b];[a][b]hstack" \
  comparison.mp4

You should see: Average inter-frame diff dropped by 30-60% on generative flicker. Codec banding should be gone in the deband output. The side-by-side will make improvement obvious.

What You Learned

Inter-frame pixel diff analysis tells you whether flicker is generative or codec-level
noise_aug_strength in SVD is the most common cause of bad temporal consistency — keep it under 0.05
RIFE adds in-between frames using optical flow, reducing perceived flicker without re-generating
FFmpeg's deflicker + deband filters are a fast free fix for codec-introduced artifacts

Limitations:

RIFE can introduce ghosting on fast motion — check output carefully
FILM requires GPU; CPU fallback is very slow on full HD footage
Temporal smoothing trades some sharpness for stability — it won't help if the model is producing fundamentally incoherent content (wrong object counts between frames, etc.)

When NOT to use this: If your generations have semantic errors between frames (a hand disappears entirely, a character's face identity shifts), post-processing won't fix it. You need to adjust prompts, reduce sequence length, or switch to a model with stronger temporal attention like Wan2.1-T2V-14B.

Tested on Wan2.1, CogVideoX-5b, SVD-XT. Python 3.11, PyTorch 2.5, FFmpeg 7.0, Ubuntu 24.04.