Control Character Poses in AI Video Generation with ControlNet

Problem: Your AI Video Characters Won't Hold a Pose

You're generating AI video and your characters drift, flail, or morph between frames. The prompt says "standing confidently" but the output looks like a slow-motion collapse. ControlNet fixes this — but video adds complexity that image workflows don't.

You'll learn:

How ControlNet pose conditioning works in video pipelines
How to extract and apply pose sequences from reference video
How to avoid the most common failure modes (temporal flicker, limb swap, scale mismatch)

Time: 25 min | Level: Intermediate

Why This Happens

Diffusion models generate each frame with some independence. Without explicit structural guidance, the model treats pose as a soft suggestion from the text prompt — and prompt adherence degrades over time, especially past 2–3 seconds.

ControlNet solves this by injecting spatial conditioning directly into the U-Net at inference time. For video, you apply a ControlNet frame-by-frame using a pose sequence extracted from a reference clip or manually keyframed.

Common symptoms without ControlNet:

Character limbs shift position between frames even in short clips
Hands and feet are worst — they drift or disappear entirely
Prompt changes (e.g., "raise left arm") apply inconsistently across frames

Solution

Step 1: Choose Your ControlNet Mode

Three modes are used for video pose control. Pick based on your source material:

Mode	Use When
`openpose`	You have a reference video with a human performer
`openpose_full`	You need hand and face landmarks (slower)
`dwpose`	Higher accuracy on complex or fast motion

For most production work, start with dwpose — it handles partial occlusion better than original OpenPose.

Install dependencies:

pip install controlnet-aux diffusers accelerate torch torchvision
# DWPose specifically:
pip install dwpose

Step 2: Extract a Pose Sequence from Reference Video

import cv2
import numpy as np
from controlnet_aux import DWposeDetector
from pathlib import Path

detector = DWposeDetector()
detector = detector.to("cuda")

def extract_pose_sequence(video_path: str, output_dir: str) -> list[np.ndarray]:
    """
    Extract per-frame pose maps from a reference video.
    Returns list of pose map arrays (H, W, 3).
    """
    cap = cv2.VideoCapture(video_path)
    pose_frames = []
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    frame_idx = 0
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        # BGR → RGB for detector
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        
        # Returns PIL Image of pose skeleton
        pose_image = detector(frame_rgb)
        pose_array = np.array(pose_image)
        pose_frames.append(pose_array)

        # Save for inspection
        cv2.imwrite(str(output_path / f"pose_{frame_idx:04d}.png"),
                    cv2.cvtColor(pose_array, cv2.COLOR_RGB2BGR))
        frame_idx += 1

    cap.release()
    print(f"Extracted {len(pose_frames)} pose frames to {output_dir}")
    return pose_frames

pose_sequence = extract_pose_sequence("reference_walk.mp4", "poses/")

Expected: A poses/ directory with one PNG per frame showing the skeleton overlay — colored lines on black background.

If it fails:

"No person detected": Check resolution — DWPose needs at least 256px on the short axis
CUDA OOM: Process in batches of 30 frames, or use detector.to("cpu")

Step 3: Run Controlled Video Generation

import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
from PIL import Image

# Load ControlNet with OpenPose weights
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-openpose",
    torch_dtype=torch.float16
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# Consistent generation across frames requires a fixed seed
generator = torch.Generator(device="cuda").manual_seed(42)

prompt = "a person walking through a forest, cinematic lighting, photorealistic"
negative_prompt = "blurry, deformed, extra limbs, low quality"

output_frames = []
for i, pose_array in enumerate(pose_sequence):
    pose_image = Image.fromarray(pose_array)

    result = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        image=pose_image,
        num_inference_steps=20,
        guidance_scale=7.5,
        controlnet_conditioning_scale=1.0,  # 0.8–1.2 is the useful range
        generator=generator,  # Fixed seed = stable appearance across frames
    )

    frame = result.images[0]
    output_frames.append(frame)
    frame.save(f"output/frame_{i:04d}.png")
    print(f"Frame {i}/{len(pose_sequence)}")

print("Generation complete.")

If characters still drift between frames:

Increase controlnet_conditioning_scale to 1.2–1.5 (trades creativity for rigidity)
Use IP-Adapter alongside ControlNet to lock character appearance separately from pose

Step 4: Assemble Frames into Video

import cv2
from pathlib import Path

def frames_to_video(frames_dir: str, output_path: str, fps: int = 24):
    frames = sorted(Path(frames_dir).glob("frame_*.png"))
    if not frames:
        raise ValueError(f"No frames found in {frames_dir}")

    sample = cv2.imread(str(frames[0]))
    h, w = sample.shape[:2]

    fourcc = cv2.VideoWriter_fourcc(*"mp4v")
    writer = cv2.VideoWriter(output_path, fourcc, fps, (w, h))

    for frame_path in frames:
        frame = cv2.imread(str(frame_path))
        writer.write(frame)

    writer.release()
    print(f"Video saved: {output_path} ({len(frames)} frames @ {fps}fps)")

frames_to_video("output/", "controlled_output.mp4", fps=24)

Expected: A playable .mp4 where the character holds the extracted pose throughout.

Verification

# Quick check: open the first and last frame side by side
python -c "
from PIL import Image
img = Image.new('RGB', (1024, 512))
first = Image.open('output/frame_0000.png').resize((512, 512))
last = Image.open('output/frame_$(ls output/ | wc -l | tr -d ' ' | awk '{print $1-1}' | xargs printf '%04d').png').resize((512, 512))
img.paste(first, (0, 0))
img.paste(last, (512, 0))
img.save('comparison.png')
"

You should see: Character in similar pose at frame 0 and final frame. Major body landmarks (shoulders, hips, knees) should be within a few pixels of the pose map.

Handling Temporal Flicker

Frame-by-frame generation produces flicker — each frame is slightly different even with the same seed. Two practical fixes:

Option A: Frame interpolation (fast)

pip install frame-interpolation
# Or use RIFE model for better quality
pip install rife-ncnn-vulkan

Option B: Use a video-native pipeline (better)

AnimateDiff and CogVideoX have built-in temporal attention that reduces flicker. Both support ControlNet conditioning:

# AnimateDiff with ControlNet (2026 stack)
from diffusers import AnimateDiffControlNetPipeline

pipe = AnimateDiffControlNetPipeline.from_pretrained(
    "guoyww/animatediff-motion-adapter-v1-5-3",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# Pass all pose frames at once — temporal attention handles consistency
video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    conditioning_frames=pose_sequence,  # List of PIL Images
    num_frames=len(pose_sequence),
    generator=generator,
).frames[0]

AnimateDiff processes the full clip in a single forward pass, so temporal consistency is handled inside the model rather than bolted on after.

What You Learned

ControlNet injects pose structure frame-by-frame; it doesn't guarantee temporal consistency on its own
dwpose outperforms classic OpenPose on partial occlusion and fast motion
Fixed seed + high controlnet_conditioning_scale is the simplest path to stable appearance
For production quality, use AnimateDiff or CogVideoX instead of single-frame SD pipelines

Limitation: ControlNet pose maps encode skeleton position, not mesh or depth. Camera angle changes will still cause visible artifacts — use depth ControlNet in parallel if the camera moves.

When NOT to use this: If your reference performer and output character have very different body proportions, the skeleton transfer will look wrong. Consider using a reference performer closer in build, or use 3D rigging (Blender + ControlNet) instead.

Tested on PyTorch 2.3, diffusers 0.27, CUDA 12.1 — Ubuntu 22.04 and Windows 11 WSL2