Control Character Poses in AI Video Generation with ControlNet

Use ControlNet for video to lock character poses, fix drift, and produce consistent motion across frames in AI-generated video.

Problem: Your AI Video Characters Won't Hold a Pose

You're generating AI video and your characters drift, flail, or morph between frames. The prompt says "standing confidently" but the output looks like a slow-motion collapse. ControlNet fixes this — but video adds complexity that image workflows don't.

You'll learn:

  • How ControlNet pose conditioning works in video pipelines
  • How to extract and apply pose sequences from reference video
  • How to avoid the most common failure modes (temporal flicker, limb swap, scale mismatch)

Time: 25 min | Level: Intermediate


Why This Happens

Diffusion models generate each frame with some independence. Without explicit structural guidance, the model treats pose as a soft suggestion from the text prompt — and prompt adherence degrades over time, especially past 2–3 seconds.

ControlNet solves this by injecting spatial conditioning directly into the U-Net at inference time. For video, you apply a ControlNet frame-by-frame using a pose sequence extracted from a reference clip or manually keyframed.

Common symptoms without ControlNet:

  • Character limbs shift position between frames even in short clips
  • Hands and feet are worst — they drift or disappear entirely
  • Prompt changes (e.g., "raise left arm") apply inconsistently across frames

Solution

Step 1: Choose Your ControlNet Mode

Three modes are used for video pose control. Pick based on your source material:

ModeUse When
openposeYou have a reference video with a human performer
openpose_fullYou need hand and face landmarks (slower)
dwposeHigher accuracy on complex or fast motion

For most production work, start with dwpose — it handles partial occlusion better than original OpenPose.

Install dependencies:

pip install controlnet-aux diffusers accelerate torch torchvision
# DWPose specifically:
pip install dwpose

Step 2: Extract a Pose Sequence from Reference Video

import cv2
import numpy as np
from controlnet_aux import DWposeDetector
from pathlib import Path

detector = DWposeDetector()
detector = detector.to("cuda")

def extract_pose_sequence(video_path: str, output_dir: str) -> list[np.ndarray]:
    """
    Extract per-frame pose maps from a reference video.
    Returns list of pose map arrays (H, W, 3).
    """
    cap = cv2.VideoCapture(video_path)
    pose_frames = []
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    frame_idx = 0
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        # BGR → RGB for detector
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        
        # Returns PIL Image of pose skeleton
        pose_image = detector(frame_rgb)
        pose_array = np.array(pose_image)
        pose_frames.append(pose_array)

        # Save for inspection
        cv2.imwrite(str(output_path / f"pose_{frame_idx:04d}.png"),
                    cv2.cvtColor(pose_array, cv2.COLOR_RGB2BGR))
        frame_idx += 1

    cap.release()
    print(f"Extracted {len(pose_frames)} pose frames to {output_dir}")
    return pose_frames

pose_sequence = extract_pose_sequence("reference_walk.mp4", "poses/")

Expected: A poses/ directory with one PNG per frame showing the skeleton overlay — colored lines on black background.

If it fails:

  • "No person detected": Check resolution — DWPose needs at least 256px on the short axis
  • CUDA OOM: Process in batches of 30 frames, or use detector.to("cpu")

Step 3: Run Controlled Video Generation

import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
from PIL import Image

# Load ControlNet with OpenPose weights
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-openpose",
    torch_dtype=torch.float16
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# Consistent generation across frames requires a fixed seed
generator = torch.Generator(device="cuda").manual_seed(42)

prompt = "a person walking through a forest, cinematic lighting, photorealistic"
negative_prompt = "blurry, deformed, extra limbs, low quality"

output_frames = []
for i, pose_array in enumerate(pose_sequence):
    pose_image = Image.fromarray(pose_array)

    result = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        image=pose_image,
        num_inference_steps=20,
        guidance_scale=7.5,
        controlnet_conditioning_scale=1.0,  # 0.8–1.2 is the useful range
        generator=generator,  # Fixed seed = stable appearance across frames
    )

    frame = result.images[0]
    output_frames.append(frame)
    frame.save(f"output/frame_{i:04d}.png")
    print(f"Frame {i}/{len(pose_sequence)}")

print("Generation complete.")

If characters still drift between frames:

  • Increase controlnet_conditioning_scale to 1.2–1.5 (trades creativity for rigidity)
  • Use IP-Adapter alongside ControlNet to lock character appearance separately from pose

Step 4: Assemble Frames into Video

import cv2
from pathlib import Path

def frames_to_video(frames_dir: str, output_path: str, fps: int = 24):
    frames = sorted(Path(frames_dir).glob("frame_*.png"))
    if not frames:
        raise ValueError(f"No frames found in {frames_dir}")

    sample = cv2.imread(str(frames[0]))
    h, w = sample.shape[:2]

    fourcc = cv2.VideoWriter_fourcc(*"mp4v")
    writer = cv2.VideoWriter(output_path, fourcc, fps, (w, h))

    for frame_path in frames:
        frame = cv2.imread(str(frame_path))
        writer.write(frame)

    writer.release()
    print(f"Video saved: {output_path} ({len(frames)} frames @ {fps}fps)")

frames_to_video("output/", "controlled_output.mp4", fps=24)

Expected: A playable .mp4 where the character holds the extracted pose throughout.


Verification

# Quick check: open the first and last frame side by side
python -c "
from PIL import Image
img = Image.new('RGB', (1024, 512))
first = Image.open('output/frame_0000.png').resize((512, 512))
last = Image.open('output/frame_$(ls output/ | wc -l | tr -d ' ' | awk '{print $1-1}' | xargs printf '%04d').png').resize((512, 512))
img.paste(first, (0, 0))
img.paste(last, (512, 0))
img.save('comparison.png')
"

You should see: Character in similar pose at frame 0 and final frame. Major body landmarks (shoulders, hips, knees) should be within a few pixels of the pose map.


Handling Temporal Flicker

Frame-by-frame generation produces flicker — each frame is slightly different even with the same seed. Two practical fixes:

Option A: Frame interpolation (fast)

pip install frame-interpolation
# Or use RIFE model for better quality
pip install rife-ncnn-vulkan

Option B: Use a video-native pipeline (better)

AnimateDiff and CogVideoX have built-in temporal attention that reduces flicker. Both support ControlNet conditioning:

# AnimateDiff with ControlNet (2026 stack)
from diffusers import AnimateDiffControlNetPipeline

pipe = AnimateDiffControlNetPipeline.from_pretrained(
    "guoyww/animatediff-motion-adapter-v1-5-3",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# Pass all pose frames at once — temporal attention handles consistency
video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    conditioning_frames=pose_sequence,  # List of PIL Images
    num_frames=len(pose_sequence),
    generator=generator,
).frames[0]

AnimateDiff processes the full clip in a single forward pass, so temporal consistency is handled inside the model rather than bolted on after.


What You Learned

  • ControlNet injects pose structure frame-by-frame; it doesn't guarantee temporal consistency on its own
  • dwpose outperforms classic OpenPose on partial occlusion and fast motion
  • Fixed seed + high controlnet_conditioning_scale is the simplest path to stable appearance
  • For production quality, use AnimateDiff or CogVideoX instead of single-frame SD pipelines

Limitation: ControlNet pose maps encode skeleton position, not mesh or depth. Camera angle changes will still cause visible artifacts — use depth ControlNet in parallel if the camera moves.

When NOT to use this: If your reference performer and output character have very different body proportions, the skeleton transfer will look wrong. Consider using a reference performer closer in build, or use 3D rigging (Blender + ControlNet) instead.


Tested on PyTorch 2.3, diffusers 0.27, CUDA 12.1 — Ubuntu 22.04 and Windows 11 WSL2