From 11 Minutes to 4 Minutes: End-to-End Acceleration for Wan Video Generation on NVIDIA H200 vs. AMD MI300X

1. Introduction

Wan is a diffusion-based generative model for high-quality video generation, producing detailed and temporally consistent outputs via iterative denoising. It showcases strong performance in visual generation tasks, particularly in producing consistent motion and style across frames.

The Wan model faces a major bottleneck in generation speed—often exceeding 10 minutes—which severely constrains its use in production environments. The goal of our study is to reduce end-to-end latency to the minute level through both algorithmic optimization (e.g., adaptive frame segmentation) and system-level enhancement (e.g., parallelism management). This article presents our efforts to improve each stage of the Wan video generation pipeline. Furthermore, we provide a performance study across NVIDIA H200 and AMD Instinct MI300X (two flagship accelerators), giving performance analysis from an architecture perspective.

Key takeaways

End-to-end latency reduced by 34–40% on H200 with full-pipeline optimizations.
Adaptive VRAM-aware clip scheduling reduces tail padding and improves GPU utilization.
With SGLang, serving latency improves by ~2.3× on a single H200 for 20 denoising steps.

The generation process in Wan can be summarized as a two-stage pipeline:

Stage 1: Preprocessing: Input video and reference image are processed to get keypoints in the input, set up bounding boxes, obtain alignment information, etc. This stage may use components like YOLO and ViTPose.
Stage 2: Inference: Using the preprocessed input in the first stage, the diffusion/generation backbone (e.g., DiT) generates frames segment-by-segment using a sliding window strategy. This strategy employs segment concatenation and an overlapping method (to be discussed as follows).

The preprocessing and inference happen on CPU and GPU, respectively, creating an execution pipeline. Hence, the end-to-end latency of Wan can be formulated as:

$T_{total} = T_{pre} + T_{inf}$

2. Methods

We describe our methods in this section.

2.1 Adaptive Frame Segmentation

2.1.1 Problem Modeling

The second stage in Wan utilizes a sliding window strategy. In particular, Wan generates long images and videos by working on small overlapping windows instead of processing the entire canvas or timeline at once. Wan splits a large image (spatially) or a long video (temporally) into overlapping segments (or windows). A segment (also referred to as a clip/window) is the basic temporal unit used in long-video diffusion inference. Instead of generating all frames at once, the model processes the video in multiple overlapping segments of length $L_{clip}$ . The first window is generated normally. Each next window is generated while being conditioned on the previously created region, using cached features and cross‑attention to maintain continuity. The overlapping areas are blended so there are no seams or flicker. This process repeats until the full image or video is complete.

Assume that $L_{target}$ and $L_{real}$ are the number of frames to generate and the number of real frames in the input video, respectively. $L_{target}$ is often larger than $L_{real}$ , resulting in computational redundancy due to concatenation divisibility and stride constraints. We formulate the relationship between $L_{target}$ and $L_{real}$ as follows.

$L_{target} = L_{real} + [(L_{clip} - L_{op}) - (L_{real} - L_{op}) \pmod{(L_{clip} - L_{op})}],$

where $L_{clip}$ and $L_{op}$ are the length of a single clip and the overlap between adjacent clips, respectively, defined in terms of the number of frames.

Problems: When $L_{real} - L_{op}$ cannot be perfectly divided by the stride ( $L_{clip} - L_{op}$ ), the input sequence (image) is padded to the nearest divisible length, causing the generation of useless frames and a linear increase in computational overhead.

In fact, in long video generation tasks, segmenting long sequences into clips that the diffusion model can process is challenging. The traditional methods often use a fixed stride for segmentation. This segmentation strategy does not consider GPU memory capacity, and could waste edge frames when the overlapping is too much.

2.1.2 Solution

To address the above problems, we propose a new strategy respecting the constraint of GPU memory capacity.

Let $L^{\prime}_{target}$ be the new total target frames and $n$ be the number of segments. To ensure temporal continuity and maximize coverage, we establish the following constraint equation:

$L^{\prime}_{target} = L_{clip} + (L_{clip} - L_{op}) \times n$

$n \ge 0$ . We must perform diffusion-model inference $n+1$ times. In practice, we limit the upper bound of $L_{clip}$ by the total frames in the input video.

2.1.3 In-Depth Analysis

We implement the above strategy and add a knob auto_set_lim to allow users to enable it. When enabled, the system solves the optimization problem to find the optimal $L_{clip}$ and $n$ , given $L_{target}$ . Our method has the following benefits.

Maximization of VRAM Utilization: By adaptively changing $L_{clip}$ to maximize the utilization of GPU memory, we avoid padding in the traditional methods, hence avoiding the waste of computation power; the GPU's tensor cores also remain high utilization at each time step.
Temporal Consistency: Our method employs a constant overlap across frames, making the video context smooth during the frame concatenation.

2.2 Removing Dependency for Thread-Level Parallelism

2.2.1 Problem Modeling: Performance Bottleneck in Preprocessing

In the stage of preprocessing (e.g., 2D pose estimation and frame extraction), there is dependency between processing of frames. For a video sequence containing $L_{target}$ frames, the preprocessing latency $T_{pre}$ is formulated as a cumulative sum of single-frame processing times, shown as follows:

$T_{pre} = \sum_{i=1}^{L_{target}} (t_{decode}^{(i)} + t_{inference}^{(i)})$

During preprocessing, each frame is processed by one CPU core. Although we can parallelize preprocessing by using multiple CPU cores, thread management overhead can offset the performance gain. Hence, preprocessing on CPU is slow, leading to pipeline bubbles on GPU.

2.2.2 Solution: Parallel Preprocessing

To address the above problem, we relax dependencies between frame-preprocessing tasks so that we can process multiple frames at the same time with multiple cores. In particular, we create a thread pool (the pool size equals to the number of CPU cores in the server). Each CPU core is assigned to one thread, and each thread preprocesses one frame. Parallel frame preprocessing is possible because the workload is typically implemented by C/C++ libraries (e.g., OpenCV) and is therefore not constrained by the Python Global Interpreter Lock (GIL).

With the above solution, the preprocessing latency is formulated as follows:

$T_{pre}' \approx \frac{1}{\min(L_{target}, K)} \sum_{i=1}^{L_{target}} (t_{process}^{(i)}) + T_{overhead}$

where $T_{overhead}$ is thread-management overhead, and $L_{target}$ and $K$ denote the number of frames and the number of CPU cores, respectively.

The above strategy for preprocessing on CPU can work for multiple GPUs where multiple GPUs may use FSDP and Ulysses parallelism to accelerate Wan on GPU. We evaluate the feasibility to support multiple GPUs.

2.2.3 In-Depth Analysis

No Limitation of GIL. Although GIL is notoriously known for limiting multi-thread performance on CPU, our profiling shows that heavy tasks in preprocessing (e.g., pose2d inference and image transformations) rely heavily on underlying C/C++ kernels (such as NumPy and PyTorch operations). These operations release the GIL during preprocessing, allowing our thread pool to make the best use of CPU cores.
Thread-level parallelism. By concurrently preprocessing frames, we leverage thread-level parallelism to improve overall throughput of preprocessing. This not only maximizes the utilization of memory bandwidth, but also transforms our optimization from latency-oriented to throughput-oriented. Since the preprocessing is the performance bottleneck of the two-stage pipeline, reducing preprocessing time speeds up the whole video generation pipeline.

3. Evaluation

Baseline: Vanilla Wan (with/without SGLang).
Hardware A: GPU: NVIDIA H200 (1 or 2 cards, 141GB HBM3e each); CPU: Intel(R) Xeon(R) Platinum 8460Y+ (18 or 36 cores).
Hardware B: GPU: AMD Instinct MI300X (1 card, 192GB HBM3); CPU: Intel(R) Xeon(R) Platinum 8568Y+（20 cores）.
Software:: We report denoising steps, clip_len ( $L_{clip}$ ), and key preprocessing flags (e.g., replace_flag, retarget_flag) inline in the tables for reproducibility.

We refer to the evaluation results using our optimization techniques as Yotta-Wan in the rest of this section.

3.1 Results

3.1.1 End-to-End Acceleration

Table 1 shows the results collected on H200.

Table 1: Evaluation results on H200

Version	Hardware	FPS	$L_{target}$	Preprocess (s)	Inference (s)	Total (s)
Vanilla	2x H200, 36 vCPU	30	-	229	129	325
Yotta-Wan	2x H200, 36 vCPU	30	153	79	133	212 (↓34.8%)
Vanilla	1x H200, 18 vCPU	30	-	229	232	684
Yotta-Wan	1x H200, 18 vCPU	30	153	120	290	410 (↓40.1%)

Preprocessing stage: The time is reduced from 229s to 79s (↓65.5%), primarily due to thread-level parallelism for preprocessing and data loading.
Inference stage: The time is reduced from 129s to 79s (↓38.7%).
Total latency: Overall, the time is reduced by at least 34%.

3.1.2 NVIDIA H200 vs. AMD MI300X

We use LoRA (lightx2v/Wan2.1-I2V-14B-720P-StepDistill-CfgDistill-Lightx2v) to improve the performance at 4 denoising steps. LoRA enables the model to adapt to tasks without costly retraining, offering a "plug-and-play" solution perfect for diverse application scenarios. To quantify the practicality of this approach, we also measure the latency incurred when loading those LoRA adapters.In all tables below, Denoising Steps denotes the number of iterative denoising iterations used during diffusion inference, and clip_len corresponds to $L_{clip}$ .

Table 2: Evaluation results on H200 and MI300X

Setup	Hardware	Preprocess	Inference	Total	$L_{clip}$	$L_{target}$	Denoising Steps
NVIDIA Vanilla	1x H200, 18 vCPU	94s (no replace)	181s	275s	77	153	4 (no LoRA)
NVIDIA Yotta-Wan	1x H200, 18 vCPU	93s (no replace)	287s	380s	77	153	4 (LoRA)
AMD Yotta-Wan (with LoRA)	1x MI300X, 20 vCPU	33s (no replace)	215s	248s	77	153	4 (LoRA)
AMD Yotta-Wan (without LoRA)	1x MI300X, 20 vCPU	33s (no replace)	220s	253s	77	153	4 (no LoRA)

To clarify, no replace means we set replace_flag=False and retarget_flag=False during preprocessing. More specifically, it refers to extracting only background and pose videos without extra processing (e.g., mask, FLUX).

VRAM capacity advantage: We observe that AMD hardware has much shorter latency (94s vs. 33s) than NVIDIA hardware in this evaluation. This performance benefit comes from larger VRAM on AMD MI300X (192 GB). This larger memory capacity can hold a larger $L_{clip}$ , thereby reducing the segment count $n$ and total video generation time.

3.1.3 Acceleration Effects with SGLang

To further validate our solution, we evaluate our method with SGLang.

Table 3: Evaluation with SGLang

Version	Hardware	Preprocess	Inference	Total	$L_{clip}$	$L_{real}$ / $L_{target}$	Denoising Steps
Vanilla	1x H200, 18 vCPU	224s	472s	696s	77	155	20
Vanilla	1x H200, 18 vCPU	220s	297s	517s	77	229	4 (no LoRA)
Yotta-Wan	1x H200, 18 vCPU	88s	209s	297s	99	197	20
Yotta-Wan	1x H200, 18 vCPU	90s	159s	249s	99	197	4 (no LoRA)
Yotta-Wan	1x H200, 18 vCPU	90s	163s	253s	99	197	4 (LoRA)

What SGLang Brings: A 2.3x Speedup for Wan Integrating SGLang into the Yotta-Wan pipeline delivers a massive leap in serving efficiency. Evaluating on a single H200 GPU, we have the following observations:

Massive End-to-End Speedup: For a standard 20-denoising-step generation, SGLang slashes the total latency by 57% (from 696s down to just 297s). This translates to a ~2.3x overall speedup, drastically improving generation throughput.
Optimized Across the Board: The acceleration applies to the entire pipeline. Preprocessing time drops by over 60% (224s to 88s), while the core inference latency is cut by more than half (472s to 209s).
Near Zero-Cost LoRA Serving: SGLang handles dynamic weights exceptionally well. Running a 4-denoising-step generation with a LoRA adapter adds a negligible 4 seconds of total overhead (253s with LoRA vs. 249s without), making it highly practical for customized deployments.

3.1.4 Discussion and Parameter Sensitivity

We analyze the impact of $L_{clip}$ . We reduce it from 77 to 21, and use a single H200:

Table 4: Sensitivity study with $L_{clip}$ .

Time	Hardware	$L_{clip}$
448s	1x H200, 18 vCPU	77
383s (↓14.5%)	1x H200, 18 vCPU	41
574s	1x H200, 18 vCPU	21

Inference: The shortest inference time is 383s, reduced by 14.5%, compared with the inference time using $L_{clip}=77$ .
Diminishing Returns: Reducing $L_{clip}$ too much (e.g., using 21) increases inference time.
Conclusion: $L_{clip}$ is not "the smaller, the better." Excessive reduction leads to too many segments, redundant overlap calculations, and increased kernel/scheduling overhead, which extends the inference time.

Further Video Length Evaluation: We evaluate performance using three methods: (1) the original fixed `clip_le

Table 5: Video 1 Performance

Method	Hardware	Preprocess	Inference	$L_{clip}$	$L_{real}/L_{target}$
vanilla_set	1x H200, 18 vCPU	143s	726s	77	308/381
auto_set	1x H200, 18 vCPU	140s	626s	153	308/381
auto_set_lim	1x H200, 18 vCPU	140s	556s	97	308/289
vanilla_set	2x H200, 36 vCPU	140s	325s	77	308/381
auto_set	2x H200, 36 vCPU	141s	339s	153	308/305
auto_set_lim	2x H200, 36 vCPU	140s	268s	102	308/289

Table 6: Video 2 Performance

Method	Hardware	Preprocess	Inference	$L_{clip}$	$L_{real}/L_{target}$
vanilla_set	1x H200, 18 vCPU	41s	142s	77	60/77
auto_set_lim	1x H200, 18 vCPU	41s	109s	57	60/57
vanilla_set	2x H200, 36 vCPU	41s	68s	77	60/77
auto_set	2x H200, 36 vCPU	40s	44s	29	60/57
auto_set_lim	2x H200, 36 vCPU	40s	48s	57	60/57

The above results confirm that $L_{clip}$ should not be too large or too small, aligned with the discussions before.

4. Conclusions

Through memory-aware dynamic segmentation and thread-level parallelism management, we reduce the inference time of the Wan model by over 30%. Empowered by high-end computing power (H200/MI300X), our efforts make one step further towards high-quality video generation for real-time interaction.

Discussion 1: Adaptive VRAM-Aware Segmentation

In traditional long-video generation, fixed stride and window sizes are commonly used. While simple, this leads to tail padding (invalid computation) or context loss when processing non-divisible frame counts. We refer to this as a Constrained Discrete Optimization Problem:

Mathematical Modeling: Find a parameter set $\{L_{clip}, n\}$ such that the total number of covered frames strictly equals to the user input $L_{target}$ , while satisfying VRAM constraints.
Core Constraint: $L_{target} = L_{clip} + (L_{clip} - L_{op}) \times n$ .
Variables:

$L_{clip}$ : Window length of a single inference (constrained by $VRAM_{capacity}$ ).
$L_{op}$ : Minimum number of overlapping frames required for temporal continuity (Temporal Overlap Constraint).
$n$ : Non-negative integer representing the number of sliding windows.

Effect: Maximizes Computational Density and eliminates Pipeline Bubbles to keep Tensor Cores saturated.

Technical Consideration: The influence of $L_{op}$ on Temporal Consistency is critical. Specifically, how latent representations in the overlap region are fused is a key factor for performance.

Discussion 2: Bandwidth-Centric Hardware Benchmarking

When comparing NVIDIA H200 and AMD MI300X, memory bandwidth is often the deciding factor rather than just TFLOPS, especially for large model tasks like video generation.

H200 HBM3e Advantage: With 141 GB HBM3e and 4.8 TB/s bandwidth, we remove the memory wall faced by Wan.
FP8 Quantization: The H200's native FP8 support (4th Gen Tensor Core) allows halving the KV Cache memory footprint. This enables a larger $L_{clip}$ under the same VRAM budget, directly increasing throughput.