Problem: Your AI Video Looks Great Frame-by-Frame but Falls Apart in Motion
You generated video with a diffusion model and individual frames look sharp. But play it back and objects shimmer, faces morph between frames, and backgrounds pulse randomly. This is temporal inconsistency — the model treats each frame as an independent image.
You'll learn:
- Why AI video models produce flickering and how to diagnose the type
- How to apply temporal smoothing at the generation stage
- How to fix existing footage in post with frame-blending and optical flow
Time: 20 min | Level: Intermediate
Why This Happens
Diffusion models sample from noise independently per frame unless explicitly constrained. Even video-native models (Wan2.1, CogVideoX, Mochi) can lose coherence on long sequences or complex motion because their temporal attention windows are finite.
Common symptoms:
- Objects "breathe" or pulse in size/color when stationary
- Skin tones shift hue between adjacent frames
- Background textures regenerate differently each frame
- Hair and fine details are the worst offenders — high-frequency detail is hardest to lock
Two distinct failure modes: generative flickering (happens during inference) and compression flickering (introduced by codec encoding). They look similar but need different fixes.
Solution
Step 1: Diagnose Which Type of Flickering You Have
Before touching any settings, run a quick analysis to separate generative from codec artifacts.
# Extract frames from your output video
ffmpeg -i output.mp4 -vf fps=24 frames/frame_%04d.png
# Compute per-frame difference — high values = generative flicker
python3 - <<'EOF'
import cv2, glob, numpy as np
frames = sorted(glob.glob("frames/*.png"))
diffs = []
for i in range(1, len(frames)):
a = cv2.imread(frames[i-1]).astype(float)
b = cv2.imread(frames[i]).astype(float)
diff = np.mean(np.abs(a - b))
diffs.append((i, diff))
print(f"Frame {i}: mean diff = {diff:.2f}")
avg = np.mean([d for _, d in diffs])
print(f"\nAverage inter-frame diff: {avg:.2f}")
print("Likely generative flicker" if avg > 12 else "Likely codec artifact")
EOF
Expected: Frame diffs over ~12 mean you have generative flicker at the source. Under 12 with visible banding usually points to codec compression. Treat both differently below.
If it fails:
- ModuleNotFoundError:
pip install opencv-python numpy - No frames extracted: Check your fps value matches the video's actual framerate
Step 2: Fix Generative Flickering at the Source
If you still have access to your generation pipeline, add temporal guidance before re-rendering.
For Wan2.1 / CogVideoX (Diffusers):
import torch
from diffusers import CogVideoXPipeline
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-5b",
torch_dtype=torch.bfloat16
).to("cuda")
# Increase guidance and reduce steps slightly — faster sampling = less drift
prompt = "your prompt here"
video = pipe(
prompt=prompt,
num_frames=49,
guidance_scale=7.5, # Higher = more prompt adherence, less frame drift
num_inference_steps=40, # Don't go under 35; temporal coherence degrades
generator=torch.Generator("cuda").manual_seed(42), # Lock seed for reproducibility
).frames[0]
For Stable Video Diffusion (SVD):
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image
pipe = StableVideoDiffusionPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt",
torch_dtype=torch.float16
)
pipe.to("cuda")
image = load_image("your_keyframe.png").resize((1024, 576))
frames = pipe(
image,
num_frames=25,
motion_bucket_id=100, # Lower = less motion = more temporal stability
noise_aug_strength=0.02, # Critical: keep below 0.05 for stability
decode_chunk_size=8, # Decode in chunks to avoid memory-induced drift
).frames[0]
Why noise_aug_strength matters: Values above 0.05 inject enough noise that each frame re-samples independently, breaking temporal coherence. This is the single most common misconfiguration.
Step 3: Post-Process Existing Footage with Optical Flow
If you can't re-generate, use RIFE or FILM for optical-flow-guided temporal smoothing on existing footage. This warps frames toward their neighbors based on estimated motion, filling in the flicker.
# Install RIFE via the practical-rife wrapper
pip install rife-ncnn-vulkan
# Apply 2x temporal interpolation (adds in-between frames, reduces perceived flicker)
rife-ncnn-vulkan -i frames/ -o rife_out/ -m rife-v4.6
# Recombine to video
ffmpeg -framerate 48 -i rife_out/frame_%08d.png \
-c:v libx264 -crf 18 -pix_fmt yuv420p \
output_stabilized.mp4
For more aggressive smoothing using FILM (Frame Interpolation for Large Motion):
# FILM via TensorFlow Hub — best for slow motion or heavy flickering
import tensorflow as tf
import tensorflow_hub as hub
model = hub.load("https://tfhub.dev/google/film/1")
# Load adjacent frames as tensors (normalized 0-1 float32)
frame_a = tf.image.decode_png(tf.io.read_file("frames/frame_0010.png")) / 255.0
frame_b = tf.image.decode_png(tf.io.read_file("frames/frame_0012.png")) / 255.0
frame_a = tf.expand_dims(frame_a, 0) # Add batch dim
frame_b = tf.expand_dims(frame_b, 0)
# time=0.5 interpolates exactly halfway between frames
result = model({"x0": frame_a, "x1": frame_b, "time": tf.constant([[0.5]])})
interpolated = result["image"][0].numpy()
If it fails:
- RIFE GPU error: Fall back to CPU with
-g -1flag - FILM out-of-memory: Resize input frames to 1280px width max before processing
Step 4: Fix Codec Flickering with a Deband + Deflicker Pass
If your analysis showed low inter-frame diffs but visible banding, the encoder introduced it. Fix in FFmpeg without re-generating anything.
# deflicker filter averages luminance across neighboring frames
# deband removes compression banding in flat areas
ffmpeg -i output.mp4 \
-vf "deflicker=size=5:mode=am,deband=1thr=0.02:2thr=0.02:3thr=0.02:blur=false" \
-c:v libx264 -crf 16 -preset slow \
output_deband.mp4
Parameter guide:
deflicker=size=5— averages over 5 frames; increase to 9 for heavy flicker (adds motion blur)deband=blur=false— keeps sharp edges; settrueif you prefer smoother gradientscrf 16— high quality re-encode; don't go above 20 or you re-introduce compression artifacts
Verification
# Run your diff script again on the processed output
python3 analyze_frames.py output_stabilized.mp4
# Visual sanity check — compare side by side at 1/4 speed
ffmpeg -i output.mp4 -i output_stabilized.mp4 \
-filter_complex "[0:v]setpts=4*PTS[a];[1:v]setpts=4*PTS[b];[a][b]hstack" \
comparison.mp4
You should see: Average inter-frame diff dropped by 30-60% on generative flicker. Codec banding should be gone in the deband output. The side-by-side will make improvement obvious.
What You Learned
- Inter-frame pixel diff analysis tells you whether flicker is generative or codec-level
noise_aug_strengthin SVD is the most common cause of bad temporal consistency — keep it under 0.05- RIFE adds in-between frames using optical flow, reducing perceived flicker without re-generating
- FFmpeg's
deflicker+debandfilters are a fast free fix for codec-introduced artifacts
Limitations:
- RIFE can introduce ghosting on fast motion — check output carefully
- FILM requires GPU; CPU fallback is very slow on full HD footage
- Temporal smoothing trades some sharpness for stability — it won't help if the model is producing fundamentally incoherent content (wrong object counts between frames, etc.)
When NOT to use this: If your generations have semantic errors between frames (a hand disappears entirely, a character's face identity shifts), post-processing won't fix it. You need to adjust prompts, reduce sequence length, or switch to a model with stronger temporal attention like Wan2.1-T2V-14B.
Tested on Wan2.1, CogVideoX-5b, SVD-XT. Python 3.11, PyTorch 2.5, FFmpeg 7.0, Ubuntu 24.04.