Mar 16, 2026
From 11 Minutes to 4 Minutes: End-to-End Acceleration for Wan Video Generation on NVIDIA H200 vs. AMD MI300X
1. Introduction Wan is a diffusion-based generative model for high-quality video generation, producing detailed and temporally consistent outputs via iterative denoising. It showcases strong performance in visual generation tasks, particularly in producing consistent motion and style across frames. The Wan model faces a major bottleneck in generation speed—often exceeding 10 minutes—which severely constrains its use in production environments. The goal of our study is to reduce end-to-end latency to the minute level through both algorithmic optimization (e.g., adaptive frame segmentation) and system-level enhancement (e.g., parallelism management). This article presents our efforts to improve each stage of the Wan video generation pipeline. Furthermore, we provide a performance study across NVIDIA H200 and AMD Instinct MI300X (two flagship accelerators), giving performance analysis from an architecture perspective. Key takeaways End-to-end latency reduced by 34–40% on H200 with full-pipeline optimizations. Adaptive VRAM-aware clip scheduling reduces tail padding and improves GPU utilization. With SGLang, serving latency improves by ~2.3× on a single H200 for 20 denoising steps. The generation process in Wan can be summarized as a two-stage pipeline: Stage 1: Preprocessing: Input video and reference image are processed to get keypoints in the input, set up bounding boxes, obtain alignment information, etc. This stage may use components like YOLO and ViTPose. Stage 2: Inference: Using the preprocessed input in the first stage, the diffusion/generation backbone (e.g., DiT) generates frames segment-by-segment using a sliding window strategy. This strategy employs segment concatenation and an overlapping method (to be discussed as follows). The preprocessing and inference happen on CPU and GPU, respectively, creating an execution pipeline. Hence, the end-to-end latency of Wan can be formulated as: 2. Methods We describe our methods in this section. 2.1 Adaptive Frame Segmentation 2.1.1 Problem Modeling In fact, in long video generation tasks, segmenting long sequences into clips that the diffusion model can process is challenging. The traditional methods often use a fixed stride for segmentation. This segmentation strategy does not consider GPU memory capacity, and could waste edge frames when the overlapping is too much. 2.1.2 Solution To address the above problems, we propose a new strategy respecting the constraint of GPU memory capacity. 2.1.3 In-Depth Analysis 2.2 Removing Dependency for Thread-Level Parallelism 2.2.1 Problem Modeling: Performance Bottleneck in Preprocessing During preprocessing, each frame is processed by one CPU core. Although we can parallelize preprocessing by using multiple CPU cores, thread management overhead can offset the performance gain. Hence, preprocessing on CPU is slow, leading to pipeline bubbles on GPU. 2.2.2 Solution: Parallel Preprocessing To address the above problem, we relax dependencies between frame-preprocessing tasks so that we can process multiple frames at the same time with multiple cores. In particular, we create a thread pool (the pool size equals to the number of CPU cores in the server). Each CPU core is assigned to one thread, and each thread preprocesses one frame. Parallel frame preprocessing is possible because the workload is typically implemented by C/C++ libraries (e.g., OpenCV) and is therefore not constrained by the Python Global Interpreter Lock (GIL). The above strategy for preprocessing on CPU can work for multiple GPUs where multiple GPUs may use FSDP and Ulysses parallelism to accelerate Wan on GPU. We evaluate the feasibility to support multiple GPUs. 2.2.3 In-Depth Analysis No Limitation of GIL. Although GIL is notoriously known for limiting multi-thread performance on CPU, our profiling shows that heavy tasks in preprocessing (e.g., pose2d inference and image transformations) rely heavily on underlying C/C++ kernels (such as NumPy and PyTorch operations). These operations release the GIL during preprocessing, allowing our thread pool to make the best use of CPU cores. Thread-level parallelism. By concurrently preprocessing frames, we leverage thread-level parallelism to improve overall throughput of preprocessing. This not only maximizes the utilization of memory bandwidth, but also transforms our optimization from latency-oriented to throughput-oriented. Since the preprocessing is the performance bottleneck of the two-stage pipeline, reducing preprocessing time speeds up the whole video generation pipeline. 3. Evaluation We refer to the evaluation results using our optimization techniques as Yotta-Wan in the rest of this section. 3.1 Results 3.1.1 End-to-End Acceleration Table 1 shows the results collected on H200. 3.1.2 NVIDIA H200 vs. AMD MI300X 3.1.3 Acceleration Effects with SGLang To further validate our solution, we evaluate our method with SGLang. 3.1.4 Discussion and Parameter Sensitivity 4. Conclusions Through memory-aware dynamic segmentation and thread-level parallelism management, we reduce the inference time of the Wan model by over 30%. Empowered by high-end computing power (H200/MI300X), our efforts make one step further towards high-quality video generation for real-time interaction. Discussion 1: Adaptive VRAM-Aware Segmentation Discussion 2: Bandwidth-Centric Hardware Benchmarking When comparing NVIDIA H200 and AMD MI300X, memory bandwidth is often the deciding factor rather than just TFLOPS, especially for large model tasks like video generation.