Engineering Insights & Research Updates

FEATURED POSTS

Inference

Mar 17, 2026

Mini-SGLang-Neuron: Bringing Lightweight LLM Inference to AWS Trainium and Inferentia

A lightweight inference framework integrating SGLang with AWS Neuron to enable efficient LLM serving on Trainium and Inferentia across multi-hardware environments.

SGLangRadixArk

Read

Products

Feb 06, 2026

Launch Templates: Infrastructure Portability for Production AI

AI infrastructure should not dictate how you build models.

Launch Templates

Read

News

Jan 16, 2026

Yotta Labs Welcomes Jack Dongarra: A Signal for the Next Era of AI Infrastructure

Dr. Jack Dongarra, 2021 ACM A.M. Turing Award recipient and architect of modern performance benchmarking, has joined Yotta Labs as a Technical & Strategic Advisor. As AI infrastructure reaches a new inflection point, Yotta Labs is applying decades of hard-won HPC lessons to build an intelligent orchestration layer for scalable, interoperable GPU systems.

Academic Research

Read

News

Jan 05, 2026

Academic Research Credit Support Program Launch

Artificial intelligence research is advancing at an unprecedented pace — yet access to scalable, reliable compute remains one of the biggest constraints facing researchers today. Across universities, research labs, and independent research communities, ambitious ideas are often slowed by limited GPU availability, high infrastructure costs, and rigid cloud environments not designed for experimentation. Researchers are forced to make tradeoffs: smaller models, fewer experiments, or long wait times for shared resources. At Yotta Labs, we believe infrastructure should enable discovery — not stand in its way. Today, we’re excited to announce the launch of the Yotta Labs Academic Research Support Program, an initiative designed to provide researchers with access to modern, production-grade AI infrastructure, backed by dedicated support and flexible pricing.

Academic Research

Read

Research

Nov 12, 2025

NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium

Enabling high-performance of AI workloads on heterogeneous hardware is one of the major missions at Yotta Labs. Yotta Labs has explored various AI accelerators (such as NVIDIA GPU, AMD GPU, and AWS Trainium) to optimize performance and reduce production costs. Recently, our chief scientist Dong Li, leading a team of researchers, made significant breakthroughs in building high-performance matrix multiplication (matmul) for LLM inference on Trainium. Evaluating with nine datasets and four recent LLMs, we show that NeuronMM largely outperforms the state-of–the-art matmul implemented by AWS on Trainium: at the level of matmul kernel, NeuronMM achieves an average 1.35× speedup (up to 2.22×), which translates to an average 1.66× speedup (up to 2.49×) for end-to-end LLM inference. The code is released at https://github.com/PASAUCMerced/NeuronMM.

AWS Trainium

Read

Research

Oct 23, 2025

Optimizing Distributed Inference Kernels for AMD DEVELOPER CHALLENGE 2025: All-to-All, GEMM-ReduceScatter, and AllGather-GEMM

This technical report presents our optimization work for the AMD Developer Challenge 2025: Distributed Inference Kernels, where we develop high-performance implementations of three critical distributed GPU kernels for single-node 8× AMD MI300X configurations. We optimize All-to-All communication for Mixture-of-Experts (MoE) models, GEMM-ReduceScatter, and AllGather-GEMM kernels through fine-grained per-token synchronization, kernel fusion techniques, and hardware-aware optimizations that leverage MI300X's 8 XCD architecture. These optimizations demonstrate significant performance improvements through communication-computation overlap, reduced memory allocations, and ROCm-specific tuning, providing practical insights for developers working with distributed kernels on AMD GPUs.

AMD GPU

Read

Research

Oct 13, 2025

Performance Optimization for Reinforcement Learning on AMD GPUs

This blog presents our performance optimization and parameter tuning methodology for Reinforcement Learning (RL) workloads using the Verl framework on AMD’s MI300X GPU platform. By capitalizing on the MI300X’s 192GB of unified memory per GPU, we test and optimize the parallelism strategy to minimize the inter-GPU communication on the three phases in GPRO; we also explore the performance with various parallelisms and reveal the nontrivial relationship between the parallelism degree and performance.

AMD GPU

Read

News

Sep 23, 2025

NSF SBIR | Decentralized Artificial Intelligence (AI) Computing Operating System for Accessible and Cost-Effective AI

Yotta Labs Awarded Competitive Grant from the U.S. National Science Foundation to Advance Decentralized AI for Accessible and Cost-Effective Computing

Academic Research

Read

Products

Jul 07, 2026

How to Deploy GLM 5.2 with SGLang on Yotta GPU Pods

A step-by-step guide to self-hosting GLM 5.2 on Yotta GPU Pods with SGLang: hardware, the FP8 checkpoint, the serve command, and how to verify the OpenAI-compatible endpoint.

SGLang

GPU Pods

Inference

Jul 02, 2026

How to Deploy GLM 5.2 with vLLM on Yotta GPU Pods

A step-by-step guide to self-hosting GLM 5.2 on Yotta GPU Pods with vLLM: hardware, the FP8 checkpoint, the serve command, and how to verify the OpenAI-compatible endpoint.

vLLM

GPU Pods

Inference

Jun 30, 2026

GLM 5.2 vs Qwen 3.7 Max: Open Weights vs the API King (2026)

GLM 5.2 is open weight and self-hostable. Qwen 3.7 Max is proprietary and API-only. Compare benchmarks, cost, context, and which fits your stack.

Cost Optimization

Distributed Inference

Inference

Jun 22, 2026

Best Serverless AI Platforms in 2026: Compared

Compare Yotta, Modal, RunPod Serverless, Replicate, and Beam on pricing, cold start, GPU options, and multi-cloud reliability for production AI.

Cost Optimization

Autoscaling

Inference

Jun 18, 2026

vLLM OpenAI-Compatible Server: A Drop-In Replacement for the OpenAI API

If your app already talks to the OpenAI API, you do not have to rewrite it to run your own model. Here is how the vLLM OpenAI-compatible server works, what to verify before you ship, and where to host it.

vLLM

Infrastructure

Jun 15, 2026

How to Deploy vLLM in Production with Docker (2026)

Run vLLM with the official Docker image and an OpenAI-compatible API, then scale it in production with autoscaling, failover, and multi-GPU serving.

vLLM

GPU Pods

Inference

Jun 12, 2026

What Is SGLang? Architecture, Performance, and When to Use It Over vLLM

SGLang is the inference engine behind some of the largest LLM deployments in production. This guide explains how RadixAttention works, where SGLang beats vLLM, and when each engine fits your workload.

SGLang

vLLM

Inference

Jun 09, 2026

vLLM vs TensorRT-LLM: Which Inference Engine Should You Use in 2026?

vLLM and TensorRT-LLM solve the same problem in opposite ways. One gets you to production in an afternoon. The other squeezes the last bit of performance out of NVIDIA hardware. Here is how to pick.

vLLM

TensorRT-LLM

Inference

Jun 05, 2026

Qwen 3.7-Max vs Claude Opus 4.6: Pricing, Benchmarks, and When to Choose Each (2026)

Qwen 3.7-Max vs Claude Opus 4.6 for production AI. Pricing, benchmarks, agent workloads, and how to use both via one API through Yotta AI Gateway.

Cost Optimization

Distributed Inference

Stay Ahead of the Curve

Deep dives into multi-silicon AI optimization, infrastructure architecture, and the science behind Yotta's performance breakthroughs.