Accelerating FastVideo on AMD GPUs with TeaCache

Accelerating FastVideo on AMD GPUs with TeaCache#

August 19, 2025 by Sopiko Kurdadze.

1 min read. | 317 total words.

Applications & models

AI/ML, Diffusion Model, GenAI

AI

Video generation is entering a new era, powered by diffusion models that deliver photorealistic and temporally consistent results from text prompts. Models like Wan2.1 push the boundaries of what’s possible in AI-generated content, but to unlock their full potential, inference performance must scale with both model complexity and hardware capabilities.

This blog introduces FastVideo running on AMD Instinct™ GPUs with ROCm - a step toward efficient, low-latency video generation on AMD platforms. You’ll learn how to enable the TeaCache optimization, set up a reproducible single-GPU inference environment, and generate high-quality videos using the Wan2.1 model - all optimized for ROCm.

This work is part of a broader set of efforts to make video generation workflows on AMD GPUs faster, more flexible, and easier to adopt. If you’re also interested in fine-tuning Wan2.2 for custom domains, check out our guide on Wan2.2 Fine-Tuning: Tailoring an Advanced Video Generation Model. For a graphical, node-based approach to building video generation pipelines, see ComfyUI for Video Generation. And for extending video creation with editing, composition, and control capabilities, explore our work on All-in-One Video Editing with VACE. Together, these tools form a comprehensive toolkit for high-performance video generation and editing on AMD hardware.

FastVideo#

FastVideo introduces several optimization techniques aimed at accelerating inference and training of video generation models. Key optimization strategies for inference include TeaCache, Sliding Tile Attention, and Sage Attention.

FastVideo initially supported only the CUDA platform, but now also supports Apple’s MPS and CPU backends. We are introducing the first steps toward supporting ROCm as an additional accelerator platform, as detailed in our contribution to FastVideo PR#669. That contribution focused on the TeaCache optimization and single-GPU inference, which we expand on in this blog.

What is TeaCache?#

TeaCache, developed by ali-vilab, stands for Timestep Embedding Aware Cache. It is a training-free caching approach that estimates and leverages the fluctuating differences in model outputs across timesteps - thereby accelerating inference. TeaCache is effective across video, image, and audio diffusion models.

The core idea behind TeaCache is based on the observation that outputs of diffusion models tend to be similar between consecutive timesteps in the denoising loop. Previous caching methods, such as uniform caching strategies, do not account for variations in output differences between timesteps and therefore fail to maximize cache efficiency.

A more effective caching strategy would reuse cached outputs more frequently when the change between consecutive outputs is minimal. However, this difference cannot be known in advance - before computing the current output. To overcome this limitation, TeaCache exploits the prior that strong correlations exist between a model’s inputs and outputs, allowing it to predict cache reuse opportunities more intelligently.

Platform and Hardware Prerequisites#

To get started, you’ll need a system that satisfies the following:

GPU : AMD Instinct™ MI300X or other ROCm-compatible GPU
Host Requirements : See ROCm system requirements

This tutorial assumes ROCm 6.3+ and Docker are available on your system.

Step-by-Step Setup for FastVideo on ROCm#

Follow the steps below to set up a complete, ROCm-enabled FastVideo inference environment.

1. Pull the Docker image#

Make sure your system is ROCm-ready:

rocm-smi

And then pull the base container:

docker pull rocm/pytorch-training:v25.6

This image comes pre-loaded with most libraries you’ll need for inference, including:

torch==2.8.0a0+git7d205b2
flash_attn==3.0.0.post1
transformers==4.46.3

2. Launch the Docker Container#

Launch the Docker container in detach mode and map the necessary directories:

docker run -d \
  --network=host \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add=video \
  --ipc=host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --privileged \
  --name fastvideo-tmp \
  -v $(pwd):/workspace/ \
  rocm/pytorch-training:v25.6 \
  tail -f /dev/null

Note: This command mounts the current directory $(pwd) to the /workspace directory in the container.

Enter the container:

docker exec -it fastvideo-tmp bash

To clean up later, use:

docker stop fastvideo-tmp
docker rm fastvideo-tmp

3. Install Dependencies#

FastVideo Framework#

git clone https://github.com/hao-ai-lab/FastVideo.git
cd FastVideo
git checkout 5452369749432b3b0d6d0f3fb5f8001e2ff95631 # Commit introducing ROCm support

Edit pyproject.toml to avoid version conflicts by including only what’s missing. Update dependencies values with this:

dependencies = [
    # Core Libraries
    "scipy==1.14.1", "six==1.16.0", "h5py==3.12.1",

    # Machine Learning & Transformers
    "timm==1.0.11", "peft>=0.15.0", "diffusers>=0.33.1",

    # Computer Vision & Image Processing
    "opencv-python==4.10.0.84", "pillow>=10.3.0", "imageio==2.36.0",
    "imageio-ffmpeg==0.5.1", "einops",

    # Experiment Tracking & Logging
    "wandb>=0.19.11", "loguru", "test-tube==0.7.5",

    # Miscellaneous Utilities
    "tqdm", "pytest", "PyYAML==6.0.1", "protobuf>=5.28.3",
    "gradio>=5.22.0", "moviepy==1.0.3", "flask",
    "flask_restful", "aiohttp", "huggingface_hub", "cloudpickle",
    # System & Monitoring Tools
    "gpustat", "watch", "remote-pdb",

    # Kernel & Packaging
    "wheel",

    # Training Dependencies
    "torchdata",
    "pyarrow",
    "datasets",
    "av",
]

And then run install:

pip install -e .

Generate a Video with Wan2.1 and TeaCache#

Create the following script:

cat > generate_video.py << 'EOF'
from fastvideo import VideoGenerator

def main():
    generator = VideoGenerator.from_pretrained(
        "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
        num_gpus=1,
    )
    prompt = "Red panda playing in a snowy forest, surrounded by pine trees and falling snowflakes"
    video = generator.generate_video(
        prompt,
        return_frames=True,
        output_path="my_videos/",
        save_video=True,
        enable_teacache=True
    )

if __name__ == "__main__":
    main()
EOF

Run it:

python generate_video.py

Generation time with TeaCache enabled:

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:55<00:00,  1.11s/it]
INFO 07-31 12:45:25 [video_generator.py:310] Generated successfully in 72.17 seconds

Generation time without TeaCache:

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [01:41<00:00,  2.02s/it]
INFO 07-31 13:06:54 [video_generator.py:310] Generated successfully in 118.19 seconds

As you can see in the video figure below, videos generated with TeaCache show almost the same visual quality as those without it, proving that caching doesn’t noticeably affect the output.

Supported Attention Backends#

FastVideo supports several attention backends on ROCm.

Flash Attention 2 and 3#

ROCm native ROCm/flash-attention is already installed in the Docker image.

FASTVIDEO_ATTENTION_BACKEND=FLASH_ATTN python generate_video.py

Torch SDPA#

Torch Scaled Dot Product Attention is also provided within the already installed torch library.

FASTVIDEO_ATTENTION_BACKEND=TORCH_SDPA python generate_video.py

Current Limitations#

⚠️ Not yet supported:

SLIDING_TILE_ATTN
SAGE_ATTN

Summary#

This blog demonstrated how to enable FastVideo inference on AMD Instinct™ GPUs, with a focus on integrating the TeaCache optimization for faster, more efficient video generation. We walked through setting up a fully functional inference environment using the official ROCm PyTorch Docker image, installing FastVideo with ROCm support, and running single-GPU inference with the Wan2.1 model. Readers learned how to take advantage of TeaCache to reduce inference time, configure supported attention backends such as Flash Attention and Torch SDPA.

We are actively tracking emerging technologies and products in video generation and editing domains, aiming to deliver an optimized and seamless user experience for video generation on AMD GPUs. Our focus is on ensuring ease of use and maximizing performance for various video generation related tasks as exemplified by recent blog posts on Fine-tuning of video generation model Wan2.2, ComfyUI for Video Generation, and All-in-One Video Editing with VACE. In parallel, we are developing additional playbooks, including model inference, model serving, and video generation workflow management.

Additional Resources#

Disclaimers#

Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.