March 23, 2026 by Yotta Labs
How to Run NemoClaw on VMs with Local LLM Inference
Learn how to run NemoClaw with local LLM inference on a GPU-powered VM. This guide covers the architecture, setup, and performance considerations for running autonomous agents fully locally.

Running AI agents in production often comes down to tradeoffs between cost, latency, and control. While many teams rely on external APIs for inference, there is a growing shift toward running models locally, especially for workloads that require consistent performance and tighter control over infrastructure.
In this guide, we’ll walk through how to run NemoClaw on a GPU-powered VM with local LLM inference. This setup allows you to run an autonomous agent fully locally using your own hardware.
If you’re looking for the full command-by-command setup and exact configuration, you can follow the complete tutorial in our docs.
What This Setup Looks Like
At a high level, this setup consists of:
- A GPU-enabled VM running your infrastructure
- A local model server powered by llama.cpp
- A proxy layer (OpenShell) that routes inference requests
- NemoClaw running as the agent runtime
In this architecture, the agent sends requests to a local inference endpoint, which is handled by your model server. This allows you to run inference entirely within your own environment.
Why Run Local Inference
Running inference locally is becoming increasingly common for teams deploying AI agents in production.
Some key benefits include:
- Lower latency — requests stay within your infrastructure
- Cost control — no per-token API costs
- Data privacy — sensitive data does not leave your environment
- Performance tuning — full control over models and hardware
For agent-based systems that run continuously, these advantages can make a significant difference.
Requirements
To run this setup, you’ll need:
- A GPU-enabled VM (for example, RTX 6000 Ada or similar)
- Ubuntu (22.04 or equivalent)
- CUDA installed
- Docker with NVIDIA container support
- Node.js
- Sufficient disk space (~50 GB recommended)
Step-by-Step Overview
At a high level, the process looks like this:
1. Install NemoClaw
Install the NemoClaw CLI and initialize your environment.
2. Build the Local Inference Engine
Install and compile llama.cpp with GPU support to serve your local model.
3. Download a Model
Download a compatible GGUF model based on your hardware and performance requirements.
4. Start the Model Server
Launch the local inference server and expose an endpoint for requests.
5. Register the Inference Provider
Configure NemoClaw to use your local inference endpoint instead of an external API.
6. Configure the Agent Runtime
Update the configuration so the agent uses your local model.
7. Test the Setup
Verify that inference requests are working and that the agent responds correctly.
For the full setup, commands, and configuration details, refer to the complete tutorial in our docs.
Performance
Performance depends on your GPU and model configuration.
On high-end GPUs like the RTX 6000 Ada:
- Q4_K_XL: ~120–140 tokens/sec
- Q6_K: ~120–140 tokens/sec with higher output quality
In many cases, performance is limited more by memory bandwidth than raw compute.
Common Issues
When running local inference setups, a few common issues can arise:
- Model output formatting errors Ensure your inference server returns OpenAI-compatible responses
- Networking issues inside containers Localhost may not work depending on your environment — use the correct IP
- Inference endpoint not resolving Make sure requests are routed correctly through your proxy layer
- Timeouts during verification Confirm the inference server is reachable from the runtime
Why This Matters
Running NemoClaw with local inference gives you full control over how your AI agents operate.
However, even a single-node setup involves:
- GPU configuration
- model optimization
- container networking
- inference routing
- runtime configuration
As you scale beyond a single VM, managing this infrastructure becomes significantly more complex.
This is where orchestration, GPU scheduling, and infrastructure automation become critical.
If you want to understand how this fits into production environments, see our guide on
How to Deploy OpenClaw in Production (Docker, Kubernetes, and GPU Infrastructure).
Final Thoughts
Local inference is quickly becoming a key part of running AI agents in production.
With NemoClaw and tools like llama.cpp, it’s now possible to run powerful models locally with strong performance and full control over your infrastructure.
As demand for AI infrastructure grows, understanding how to deploy and manage these systems will become increasingly important.
