AI Research | Efficient ML & Decentralized GPU Orchestration

Efficient ML & Distributed GPU Orchestration

We publish practical research on workload optimization and orchestration across heterogeneous GPUs. Find papers, reproducible benchmarks, grants, and media coverage

PUBLICATION

Peer-reviewed papers and preprints on efficient training, model offloading, inference latency, and GPU schedking.

Highly-efficient billion-scale AI models training and inference using affordable GPUs

ZeRO-Offload and Sentinel for transformers

USENIX ATC’21 HPCA’21

DyNN-Offload for Mixture-of-Experts (MoE)

HPCA’24

TECO-Offload on disaggregated memory

SC’24

Billion-scale graph neural network

ASPLOS’23

AI training based on parallelism management

Runtime Concurrency Control and Operation Scheduling

IPDPS’21

Tree structure-aware high performance inference engine

EuroSys’21

AI training using novel hardware

Energy-efficient training on GPU-FPGA accelerators

ICS’21

Processing-in-memory for energy-efficient DNN

MICRO’18

AWARD

Decentralized AI Computing Operating System for Accessible and Cost-Effective AI

MEDIA REPORTS

Microsoft And The University Of California, Merced Introduces ZeRO-Offload, A Novel Heterogeneous DeepLearning...

What’s New in HPC Research: Galaxies, Fugaku, Electron Microscopes & More

Fit More and Train Faster With ZeRO via DeepSpeed and FairScale

Microsoft Releases AI Training Library ZeRO-3 Offload