March 3, 2025 by Da Li
From 11 Minutes to 4 Minutes: End-to-End Acceleration for Wan Video Generation on NVIDIA H200 vs. AMD MI300X
1. Introduction
Wan is a diffusion-based generative model for high-quality video generation, producing detailed and temporally consistent outputs via iterative denoising. It showcases strong performance in visual generation tasks, particularly in producing consistent motion and style across frames.
The Wan model faces a major bottleneck in generation speed—often exceeding 10 minutes—which severely constrains its use in production environments. The goal of our study is to reduce end-to-end latency to the minute level through both algorithmic optimization (e.g., adaptive frame segmentation) and system-level enhancement (e.g., parallelism management). This article presents our efforts to improve each stage of the Wan video generation pipeline. Furthermore, we provide a performance study across NVIDIA H200 and AMD Instinct MI300X (two flagship accelerators), giving performance analysis from an architecture perspective.
Key takeaways
- End-to-end latency reduced by 34–40% on H200 with full-pipeline optimizations.
- Adaptive VRAM-aware clip scheduling reduces tail padding and improves GPU utilization.
- CPU preprocessing becomes throughput-oriented via thread-level parallelism (heavy C/C++ kernels release the GIL).
- With SGLang, serving latency improves by ~2.3× on a single H200 for 20 denoising steps.
The generation process in Wan can be summarized as a two-stage pipeline:
- Stage 1: Preprocessing: Input video and reference image are processed to get keypoints in the input, set up bounding boxes, obtain alignment information, etc. This stage may use components like YOLO and ViTPose.
- Stage 2: Inference: Using the preprocessed input in the first stage, the diffusion/generation backbone (e.g., DiT) generates frames segment-by-segment using a sliding window strategy. This strategy employs segment concatenation and an overlapping method (to be discussed as follows).
The preprocessing and inference happen on CPU and GPU, respectively, creating an execution pipeline. Hence, the end-to-end latency of Wan can be formulated as:
2. Methods
We describe our methods in this section.
2.1 Adaptive Frame Segmentation
2.1.1 Problem Modeling
The second stage in Wan utilizes a sliding window strategy. In particular, Wan generates long images and videos by working on small overlapping windows instead of processing the entire canvas or timeline at once. Wan splits a large image (spatially) or a long video (temporally) into overlapping segments (or windows). A segment (also referred to as a clip/window) is the basic temporal unit used in long-video diffusion inference. Instead of generating all frames at once, the model processes the video in multiple overlapping segments of length . The first window is generated normally. Each next window is generated while being conditioned on the previously created region, using cached features and cross‑attention to maintain continuity. The overlapping areas are blended so there are no seams or flicker. This process repeats until the full image or video is complete.
Assume that and are the number of frames to generate and the number of real frames in the input video, respectively. is often larger than , resulting in computational redundancy due to concatenation divisibility and stride constraints. We formulate the relationship between and as follows.
where and are the length of a single clip and the overlap between adjacent clips, respectively, defined in terms of the number of frames.
Problems: When cannot be perfectly divided by the stride (), the input sequence (image) is padded to the nearest divisible length, causing the generation of useless frames and a linear increase in computational overhead.
In fact, in long video generation tasks, segmenting long sequences into clips that the diffusion model can process is challenging. The traditional methods often use a fixed stride for segmentation. This segmentation strategy does not consider GPU memory capacity, and could waste edge frames when the overlapping is too much.
2.1.2 Solution
To address the above problems, we propose a new strategy respecting the constraint of GPU memory capacity.
Let be the new total target frames and be the number of segments. To ensure temporal continuity and maximize coverage, we establish the following constraint equation:
. We must perform diffusion-model inference times. In practice, we limit the upper bound of by the total frames in the input video.
2.1.3 In-Depth Analysis
We implement the above strategy and add a knob auto_set_lim to allow users to enable it. When enabled, the system solves the optimization problem to find the optimal and , given . Our method has the following benefits.
-
Maximization of VRAM Utilization: By adaptively changing to maximize the utilization of GPU memory, we avoid padding in the traditional methods, hence avoiding the waste of computation power; the GPU's tensor cores also remain high utilization at each time step.
-
Temporal Consistency: Our method employs a constant overlap across frames, making the video context smooth during the frame concatenation.
2.2 Removing Dependency for Thread-Level Parallelism
2.2.1 Problem Modeling: Performance Bottleneck in Preprocessing
In the stage of preprocessing (e.g., 2D pose estimation and frame extraction), there is dependency between processing of frames. For a video sequence containing frames, the preprocessing latency is formulated as a cumulative sum of single-frame processing times, shown as follows:
During preprocessing, each frame is processed by one CPU core. Although we can parallelize preprocessing by using multiple CPU cores, thread management overhead can offset the performance gain. Hence, preprocessing on CPU is slow, leading to pipeline bubbles on GPU.
2.2.2 Solution: Parallel Preprocessing
To address the above problem, we relax dependencies between frame-preprocessing tasks so that we can process multiple frames at the same time with multiple cores. In particular, we create a thread pool (the pool size equals to the number of CPU cores in the server). Each CPU core is assigned to one thread, and each thread preprocesses one frame. Parallel frame preprocessing is possible because the workload is typically implemented by C/C++ libraries (e.g., OpenCV) and is therefore not constrained by the Python Global Interpreter Lock (GIL).
With the above solution, the preprocessing latency is formulated as follows:
where is thread-management overhead, and and denote the number of frames and the number of CPU cores, respectively.
The above strategy for preprocessing on CPU can work for multiple GPUs where multiple GPUs may use FSDP and Ulysses parallelism to accelerate Wan on GPU. We evaluate the feasibility to support multiple GPUs.
2.2.3 In-Depth Analysis
- No Limitation of GIL. Although GIL is notoriously known for limiting multi-thread performance on CPU, our profiling shows that heavy tasks in preprocessing (e.g.,
pose2dinference and image transformations) rely heavily on underlying C/C++ kernels (such as NumPy and PyTorch operations). These operations release the GIL during preprocessing, allowing our thread pool to make the best use of CPU cores. - Thread-level parallelism. By concurrently preprocessing frames, we leverage thread-level parallelism to improve overall throughput of preprocessing. This not only maximizes the utilization of memory bandwidth, but also transforms our optimization from latency-oriented to throughput-oriented. Since the preprocessing is the performance bottleneck of the two-stage pipeline, reducing preprocessing time speeds up the whole video generation pipeline.
3. Evaluation
-
Baseline: Vanilla Wan (with/without SGLang).
-
Hardware A: GPU: NVIDIA H200 (1 or 2 cards, 141GB HBM3e each); CPU: Intel(R) Xeon(R) Platinum 8460Y+ (18 or 36 cores).
-
Hardware B: GPU: AMD Instinct MI300X (1 card, 192GB HBM3); CPU: Intel(R) Xeon(R) Platinum 8568Y+(20 cores).
-
Software:: We report denoising steps, clip_len (), and key preprocessing flags (e.g., replace_flag, retarget_flag) inline in the tables for reproducibility.
We refer to the evaluation results using our optimization techniques as Yotta-Wan in the rest of this section.
3.1 Results
3.1.1 End-to-End Acceleration
Table 1 shows the results collected on H200.
Table 1: Evaluation results on H200
| Version | Hardware | FPS | Preprocess (s) | Inference (s) | Total (s) | |
|---|---|---|---|---|---|---|
| Vanilla | 2x H200, 36 vCPU | 30 | - | 229 | 129 | 325 |
| Yotta-Wan | 2x H200, 36 vCPU | 30 | 153 | 79 | 133 | 212 (↓34.8%) |
| Vanilla | 1x H200, 18 vCPU | 30 | - | 229 | 232 | 684 |
| Yotta-Wan | 1x H200, 18 vCPU | 30 | 153 | 120 | 290 | 410 (↓40.1%) |
- Preprocessing stage: The time is reduced from 229s to 79s (↓65.5%), primarily due to thread-level parallelism for preprocessing and data loading.
- Inference stage: The time is reduced from 129s to 79s (↓38.7%).
- Total latency: Overall, the time is reduced by at least 34%.
3.1.2 NVIDIA H200 vs. AMD MI300X
We use LoRA (lightx2v/Wan2.1-I2V-14B-720P-StepDistill-CfgDistill-Lightx2v) to improve the performance at 4 denoising steps. LoRA enables the model to adapt to tasks without costly retraining, offering a "plug-and-play" solution perfect for diverse application scenarios. To quantify the practicality of this approach, we also measure the latency incurred when loading those LoRA adapters.In all tables below, Denoising Steps denotes the number of iterative denoising iterations used during diffusion inference, and clip_len corresponds to .
Table 2: Evaluation results on H200 and MI300X
| Setup | Hardware | Preprocess | Inference | Total | Denoising Steps | ||
|---|---|---|---|---|---|---|---|
| NVIDIA Vanilla | 1x H200, 18 vCPU | 94s (no replace) | 181s | 275s | 77 | 153 | 4 (no LoRA) |
| NVIDIA Yotta-Wan | 1x H200, 18 vCPU | 93s (no replace) | 287s | 380s | 77 | 153 | 4 (LoRA) |
| AMD Yotta-Wan (with LoRA) | 1x MI300X, 20 vCPU | 33s (no replace) | 215s | 248s | 77 | 153 | 4 (LoRA) |
| AMD Yotta-Wan (without LoRA) | 1x MI300X, 20 vCPU | 33s (no replace) | 220s | 253s | 77 | 153 | 4 (no LoRA) |
To clarify, no replace means we set replace_flag=False and retarget_flag=False during preprocessing. More specifically, it refers to extracting only background and pose videos without extra processing (e.g., mask, FLUX).
- VRAM capacity advantage: We observe that AMD hardware has much shorter latency (94s vs. 33s) than NVIDIA hardware in this evaluation. This performance benefit comes from larger VRAM on AMD MI300X (192 GB). This larger memory capacity can hold a larger , thereby reducing the segment count and total video generation time.
3.1.3 Acceleration Effects with SGLang
To further validate our solution, we evaluate our method with SGLang.
Table 3: Evaluation with SGLang
| Version | Hardware | Preprocess | Inference | Total | / | Denoising Steps | |
|---|---|---|---|---|---|---|---|
| Vanilla | 1x H200, 18 vCPU | 224s | 472s | 696s | 77 | 155 | 20 |
| Vanilla | 1x H200, 18 vCPU | 220s | 297s | 517s | 77 | 229 | 4 (no LoRA) |
| Yotta-Wan | 1x H200, 18 vCPU | 88s | 209s | 297s | 99 | 197 | 20 |
| Yotta-Wan | 1x H200, 18 vCPU | 90s | 159s | 249s | 99 | 197 | 4 (no LoRA) |
| Yotta-Wan | 1x H200, 18 vCPU | 90s | 163s | 253s | 99 | 197 | 4 (LoRA) |
What SGLang Brings: A 2.3x Speedup for Wan Integrating SGLang into the Yotta-Wan pipeline delivers a massive leap in serving efficiency. Evaluating on a single H200 GPU, we have the following observations:
- Massive End-to-End Speedup: For a standard 20-denoising-step generation, SGLang slashes the total latency by 57% (from 696s down to just 297s). This translates to a ~2.3x overall speedup, drastically improving generation throughput.
- Optimized Across the Board: The acceleration applies to the entire pipeline. Preprocessing time drops by over 60% (224s to 88s), while the core inference latency is cut by more than half (472s to 209s).
- Near Zero-Cost LoRA Serving: SGLang handles dynamic weights exceptionally well. Running a 4-denoising-step generation with a LoRA adapter adds a negligible 4 seconds of total overhead (253s with LoRA vs. 249s without), making it highly practical for customized deployments.
3.1.4 Discussion and Parameter Sensitivity
We analyze the impact of . We reduce it from 77 to 21, and use a single H200:
Table 4: Sensitivity study with .
| Time | Hardware | |
|---|---|---|
| 448s | 1x H200, 18 vCPU | 77 |
| 383s (↓14.5%) | 1x H200, 18 vCPU | 41 |
| 574s | 1x H200, 18 vCPU | 21 |
-
Inference: The shortest inference time is 383s, reduced by 14.5%, compared with the inference time using .
-
Diminishing Returns: Reducing too much (e.g., using 21) increases inference time.
-
Conclusion: is not "the smaller, the better." Excessive reduction leads to too many segments, redundant overlap calculations, and increased kernel/scheduling overhead, which extends the inference time.
Further Video Length Evaluation: We evaluate performance using three methods: (1) the original fixed `clip_le
Table 5: Video 1 Performance
| Method | Hardware | Preprocess | Inference | ||
|---|---|---|---|---|---|
| vanilla_set | 1x H200, 18 vCPU | 143s | 726s | 77 | 308/381 |
| auto_set | 1x H200, 18 vCPU | 140s | 626s | 153 | 308/381 |
| auto_set_lim | 1x H200, 18 vCPU | 140s | 556s | 97 | 308/289 |
| vanilla_set | 2x H200, 36 vCPU | 140s | 325s | 77 | 308/381 |
| auto_set | 2x H200, 36 vCPU | 141s | 339s | 153 | 308/305 |
| auto_set_lim | 2x H200, 36 vCPU | 140s | 268s | 102 | 308/289 |
Table 6: Video 2 Performance
| Method | Hardware | Preprocess | Inference | ||
|---|---|---|---|---|---|
| vanilla_set | 1x H200, 18 vCPU | 41s | 142s | 77 | 60/77 |
| auto_set_lim | 1x H200, 18 vCPU | 41s | 109s | 57 | 60/57 |
| vanilla_set | 2x H200, 36 vCPU | 41s | 68s | 77 | 60/77 |
| auto_set | 2x H200, 36 vCPU | 40s | 44s | 29 | 60/57 |
| auto_set_lim | 2x H200, 36 vCPU | 40s | 48s | 57 | 60/57 |
The above results confirm that should not be too large or too small, aligned with the discussions before.
4. Conclusions
Through memory-aware dynamic segmentation and thread-level parallelism management, we reduce the inference time of the Wan model by over 30%. Empowered by high-end computing power (H200/MI300X), our efforts make one step further towards high-quality video generation for real-time interaction.
Discussion 1: Adaptive VRAM-Aware Segmentation
In traditional long-video generation, fixed stride and window sizes are commonly used. While simple, this leads to tail padding (invalid computation) or context loss when processing non-divisible frame counts. We refer to this as a Constrained Discrete Optimization Problem:
-
Mathematical Modeling: Find a parameter set such that the total number of covered frames strictly equals to the user input , while satisfying VRAM constraints.
-
Core Constraint: .
-
Variables:
-
: Window length of a single inference (constrained by ).
-
: Minimum number of overlapping frames required for temporal continuity (Temporal Overlap Constraint).
-
: Non-negative integer representing the number of sliding windows.
- Effect: Maximizes Computational Density and eliminates Pipeline Bubbles to keep Tensor Cores saturated.
Technical Consideration: The influence of on Temporal Consistency is critical. Specifically, how latent representations in the overlap region are fused is a key factor for performance.
Discussion 2: Bandwidth-Centric Hardware Benchmarking
When comparing NVIDIA H200 and AMD MI300X, memory bandwidth is often the deciding factor rather than just TFLOPS, especially for large model tasks like video generation.
-
H200 HBM3e Advantage: With 141 GB HBM3e and 4.8 TB/s bandwidth, we remove the memory wall faced by Wan.
-
FP8 Quantization: The H200's native FP8 support (4th Gen Tensor Core) allows halving the KV Cache memory footprint. This enables a larger under the same VRAM budget, directly increasing throughput.
