Kernel-level inference optimization across
NVIDIA, AMD, and AWS Trainium. Backed by published research,
open-source tools, and production deployments — no hardware lock-in.
2.3×
SGLang Speedup on Nvidia H200
4.3×
Inference Throughput on AMD GPUs (RL Rollout)
2.09×
NeuronMM Speedup on AWS Trainium

Multi-Silicon Optimization
One Team. Every Chip.
Most providers optimize for one architecture. Yotta has published peer-reviewed research and production-grade kernels across all three major AI silicon platforms.
H200 / H100 / A100
Industry-Standard, Fully Optimized
Yotta's inference stack is battle-tested on NVIDIA's flagship GPUs. From H100 to H200, we deliver maximum throughput through SGLang integration, FP8 quantization, and adaptive VRAM-aware scheduling.
SGLang runtime integration for 2.3× end-to-end speedup
Adaptive frame segmentation maximizing Tensor Core utilization
Near zero-cost LoRA serving (+4s overhead)
FP8 quantization support on H200 4th Gen Tensor Cores
Read: Wan2.x Video Generation (NVIDIA vs. AMD)
2.3×
Wan2.x Video Speedup (SGLang)
65.5%
Preprocessing Reduction
40.1%
Total Latency Reduction

Pioneering Work
SGLang Meets AWS Trainium
Yotta Labs pioneered the integration of SGLang — the leading LLM serving framework — with AWS Trainium hardware. This breakthrough unlocks Trainium's cost efficiency — up to 30–40% better price-performance than NVIDIA H100 EC2 instances per AWS published benchmarks — without sacrificing the developer experience of the SGLang ecosystem.
2.3× End-to-End Speedup
SGLang on H200: 696s → 297s for 20-step video generation
Near Zero-Cost LoRA Serving
+4s overhead to load adapters; near-zero throughput hit on long prompts
NeuronMM Open Source
Custom Trainium matmul kernel available at github.com/PASAUCMerced/NeuronMM

Technical Research
Research That Ships to Production
Our team publishes technical reports on every major optimization. Read
the research, then deploy it on Yotta's platform.
Research

Mar 16, 2026
From 11 Minutes to 4 Minutes: End-to-End Acceleration for Wan Video Generation on NVIDIA H200 vs. AMD MI300X
1. Introduction Wan is a diffusion-based generative model for high-quality video generation, producing detailed and temporally consistent outputs via iterative denoising. It showcases strong performance in visual generation tasks, particularly in producing consistent motion and style across frames. The Wan model faces a major bottleneck in generation speed—often exceeding 10 minutes—which severely constrains its use in production environments. The goal of our study is to reduce end-to-end latency to the minute level through both algorithmic optimization (e.g., adaptive frame segmentation) and system-level enhancement (e.g., parallelism management). This article presents our efforts to improve each stage of the Wan video generation pipeline. Furthermore, we provide a performance study across NVIDIA H200 and AMD Instinct MI300X (two flagship accelerators), giving performance analysis from an architecture perspective. Key takeaways End-to-end latency reduced by 34–40% on H200 with full-pipeline optimizations. Adaptive VRAM-aware clip scheduling reduces tail padding and improves GPU utilization. With SGLang, serving latency improves by ~2.3× on a single H200 for 20 denoising steps. The generation process in Wan can be summarized as a two-stage pipeline: Stage 1: Preprocessing: Input video and reference image are processed to get keypoints in the input, set up bounding boxes, obtain alignment information, etc. This stage may use components like YOLO and ViTPose. Stage 2: Inference: Using the preprocessed input in the first stage, the diffusion/generation backbone (e.g., DiT) generates frames segment-by-segment using a sliding window strategy. This strategy employs segment concatenation and an overlapping method (to be discussed as follows). The preprocessing and inference happen on CPU and GPU, respectively, creating an execution pipeline. Hence, the end-to-end latency of Wan can be formulated as: 2. Methods We describe our methods in this section. 2.1 Adaptive Frame Segmentation 2.1.1 Problem Modeling In fact, in long video generation tasks, segmenting long sequences into clips that the diffusion model can process is challenging. The traditional methods often use a fixed stride for segmentation. This segmentation strategy does not consider GPU memory capacity, and could waste edge frames when the overlapping is too much. 2.1.2 Solution To address the above problems, we propose a new strategy respecting the constraint of GPU memory capacity. 2.1.3 In-Depth Analysis 2.2 Removing Dependency for Thread-Level Parallelism 2.2.1 Problem Modeling: Performance Bottleneck in Preprocessing During preprocessing, each frame is processed by one CPU core. Although we can parallelize preprocessing by using multiple CPU cores, thread management overhead can offset the performance gain. Hence, preprocessing on CPU is slow, leading to pipeline bubbles on GPU. 2.2.2 Solution: Parallel Preprocessing To address the above problem, we relax dependencies between frame-preprocessing tasks so that we can process multiple frames at the same time with multiple cores. In particular, we create a thread pool (the pool size equals to the number of CPU cores in the server). Each CPU core is assigned to one thread, and each thread preprocesses one frame. Parallel frame preprocessing is possible because the workload is typically implemented by C/C++ libraries (e.g., OpenCV) and is therefore not constrained by the Python Global Interpreter Lock (GIL). The above strategy for preprocessing on CPU can work for multiple GPUs where multiple GPUs may use FSDP and Ulysses parallelism to accelerate Wan on GPU. We evaluate the feasibility to support multiple GPUs. 2.2.3 In-Depth Analysis No Limitation of GIL. Although GIL is notoriously known for limiting multi-thread performance on CPU, our profiling shows that heavy tasks in preprocessing (e.g., pose2d inference and image transformations) rely heavily on underlying C/C++ kernels (such as NumPy and PyTorch operations). These operations release the GIL during preprocessing, allowing our thread pool to make the best use of CPU cores. Thread-level parallelism. By concurrently preprocessing frames, we leverage thread-level parallelism to improve overall throughput of preprocessing. This not only maximizes the utilization of memory bandwidth, but also transforms our optimization from latency-oriented to throughput-oriented. Since the preprocessing is the performance bottleneck of the two-stage pipeline, reducing preprocessing time speeds up the whole video generation pipeline. 3. Evaluation We refer to the evaluation results using our optimization techniques as Yotta-Wan in the rest of this section. 3.1 Results 3.1.1 End-to-End Acceleration Table 1 shows the results collected on H200. 3.1.2 NVIDIA H200 vs. AMD MI300X 3.1.3 Acceleration Effects with SGLang To further validate our solution, we evaluate our method with SGLang. 3.1.4 Discussion and Parameter Sensitivity 4. Conclusions Through memory-aware dynamic segmentation and thread-level parallelism management, we reduce the inference time of the Wan model by over 30%. Empowered by high-end computing power (H200/MI300X), our efforts make one step further towards high-quality video generation for real-time interaction. Discussion 1: Adaptive VRAM-Aware Segmentation Discussion 2: Bandwidth-Centric Hardware Benchmarking When comparing NVIDIA H200 and AMD MI300X, memory bandwidth is often the deciding factor rather than just TFLOPS, especially for large model tasks like video generation.
Nvidia GPU
AMD GPU
Research

Feb 06, 2026
Orchestrating AI Across Multi-Silicon, Multi-Cloud, and Heterogeneous Clusters
Modern AI workloads are no longer confined to a single GPU or cloud. Today’s models are trained and deployed across NVIDIA H100s, AMD MI300s, Google TPUs, AWS Trainium, and emerging accelerators, running across AWS, GCP, Azure, and private data centers. This heterogeneity offers unprecedented performance and cost efficiency potential—but it also introduces a new class of operational complexity that traditional cloud orchestration platforms were never designed to handle.. Without a unified systems layer, teams face fragmented tooling, inefficient utilization, and fragile production pipelines. At Yotta Labs, we are solving the dual challenge of multi-silicon and multi-cloud orchestration. Our mission: to provide a unified AI operating system that abstracts hardware and cloud differences while maximizing efficiency, scalability, and reliability. The New Reality: Heterogeneous AI at Scale Distributing AI workloads across multi-silicon and multi-cloud environments presents several challenges: Diverse runtimes and APIs: Each accelerator has unique kernel implementations, memory hierarchies, and interconnects. Fragmented resources: Idle GPUs or underutilized cloud instances inflate costs. Non-portable optimizations: Operator fusion or kernel tuning optimized for one accelerator rarely works on another. Distributed synchronization overhead: Gradient updates, pipeline parallelism, and sharded datasets across clouds introduce latency and bandwidth constraints. Without a unified orchestration layer, engineering teams spend more time managing infrastructure than improving models. How Yotta Labs Orchestrates Multi-Silicon, Multi-Cloud Workloads The Yotta Labs AI OS abstracts complexity while delivering peak performance. It provides cluster-level scheduling, device-aware memory management, and cross-cloud optimization—all through unified APIs. Here’s how it works in practice: 1. Intelligent Workload Scheduling Yotta continuously profiles workloads across: Compute intensity Memory requirements Communication patterns Network latency Using this telemetry, the scheduler dynamically maps tasks to optimal devices and regions. Example A hybrid model with attention-heavy layers and large convolutional layers is split across NVIDIA H100s for matrix-heavy operations and AMD MI300s for tensor-intensive operations, maximizing throughput. The scheduler continuously monitors GPU utilization and latency, moving tasks between devices and clouds to avoid bottlenecks. Key capabilities: Hardware-aware placement Cross-region load balancing Utilization-driven reallocation 2. Memory-Aware Optimization Memory fragmentation and limited GPU RAM are major constraints for large-scale models. Yotta implements device-aware memory management: Hierarchical allocation Dynamic tensor placement Intelligent offloading Cross-device memory pooling Example During multi-cloud distributed training, embeddings and intermediate tensors are offloaded to high-bandwidth caching nodes when local GPU memory approaches saturation, reducing out-of-memory errors. Operator fusion and memory reuse strategies ensure that each accelerator runs at maximum memory efficiency, even when executing heterogeneous workloads. 3. Throughput Balancing Across Clusters Throughput is maximized by balancing workloads across GPUs, accelerators, and clouds. Yotta optimize batch sizes, pipeline depth, device assignments, and communication topology to maintain global optimal performance Example A hybrid inference workload spreads requests across GCP TPUs and AWS H100 nodes. Latency-sensitive requests are routed to low-latency regions, while batch-processing jobs utilize idle capacity in lower-cost clouds.The system automatically adjusts batch sizes, device assignments, and communication patterns to maintain steady throughput without overloading any single resource. 4. Cross-Cloud Synchronization and Reliability Distributed AI systems must tolerate partial failures, network variance, and cloud outages. The OS manages gradient synchronization, pipeline parallelism, and checkpointing across clouds with minimal latency impact Gradients are compressed and asynchronously synchronized across regions to reduce network overhead. Checkpoints are stored redundantly in cloud-agnostic storage to ensure resiliency against outages. This enables faster recovery, reduced training disruption, and predictable production behavior. What This Enables for AI Teams: Seamless Multi-Silicon, Multi-Cloud AI With Yotta Labs, AI teams gain system-level leverage. They can: Train and deploy models that span GPUs, TPUs, and emerging accelerators without rewriting code. Optimize memory usage dynamically, scaling model size without increasing cost. Balance workloads across clouds and accelerators for maximum performance and efficiency. Adopt new hardware or cloud platforms immediately, without manual profiling and retuning From Infrastructure Management to Model Innovation The next decade of AI will be defined by: Heterogeneous accelerators Distributed execution Elastic scheduling Continuous hardware evolution Winning teams will not be those with the most GPUs—but those with the most efficient orchestration layer for multi-silicon, multi-cloud workloads. Yotta Labs is building that layer. By abstracting hardware complexity and embedding optimization into the control plane, we make infrastructure invisible, so teams can focus on building models, products, and breakthroughs. The future of AI is heterogeneous, distributed, and dynamic. Yotta makes it operable.
Distributed Inference
Decentralized AI
Research

Nov 12, 2025
NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium
Enabling high-performance of AI workloads on heterogeneous hardware is one of the major missions at Yotta Labs. Yotta Labs has explored various AI accelerators (such as NVIDIA GPU, AMD GPU, and AWS Trainium) to optimize performance and reduce production costs. Recently, our chief scientist Dong Li, leading a team of researchers, made significant breakthroughs in building high-performance matrix multiplication (matmul) for LLM inference on Trainium. Evaluating with nine datasets and four recent LLMs, we show that NeuronMM largely outperforms the state-of–the-art matmul implemented by AWS on Trainium: at the level of matmul kernel, NeuronMM achieves an average 1.35× speedup (up to 2.22×), which translates to an average 1.66× speedup (up to 2.49×) for end-to-end LLM inference. The code is released at https://github.com/PASAUCMerced/NeuronMM.
Featured
AWS Trainium

Why Yotta
Why Yotta for Inference?
The difference between a GPU cloud and an AI infrastructure partner
Kernel-Level Expertise
Our team writes custom GPU kernels — not just wrappers. From NeuronMM on Trainium to fused GEMM kernels on AMD, we optimize at the hardware instruction level.
True Multi-Silicon
Unlike providers locked to NVIDIA, Yotta has published research and production-grade optimizations across NVIDIA, AMD, and AWS Trainium — giving you hardware flexibility without performance compromise.
Research-Backed, Production-Ready
Led by Chief Scientist Dong Li and backed by NSF funding, our optimizations are peer-reviewed, open-sourced, and deployed in real production workloads.
Orchestration-First Philosophy
Performance is determined by orchestration strategy, not just raw hardware. Yotta's DeOS platform routes inference workloads to the optimal silicon based on cost, latency, and availability.
Bring us your model, your hardware, and your latency budget.
We'll show you a deployment that hits it — and what it costs.