How to Run NemoClaw on VMs with Local LLM Inference

Running AI agents in production often comes down to tradeoffs between cost, latency, and control. While many teams rely on external APIs for inference, there is a growing shift toward running models locally, especially for workloads that require consistent performance and tighter control over infrastructure.

In this guide, we’ll walk through how to run NemoClaw on a GPU-powered VM with local LLM inference. This setup allows you to run an autonomous agent fully locally using your own hardware.

If you’re looking for the full command-by-command setup and exact configuration, you can follow the complete tutorial in our docs.

What This Setup Looks Like

At a high level, this setup consists of:

A GPU-enabled VM running your infrastructure
A local model server powered by llama.cpp
A proxy layer (OpenShell) that routes inference requests
NemoClaw running as the agent runtime

In this architecture, the agent sends requests to a local inference endpoint, which is handled by your model server. This allows you to run inference entirely within your own environment.

Why Run Local Inference

Running inference locally is becoming increasingly common for teams deploying AI agents in production.

Some key benefits include:

Lower latency — requests stay within your infrastructure
Cost control — no per-token API costs
Data privacy — sensitive data does not leave your environment
Performance tuning — full control over models and hardware

For agent-based systems that run continuously, these advantages can make a significant difference.

Requirements

To run this setup, you’ll need:

A GPU-enabled VM (for example, RTX 6000 Ada or similar)
Ubuntu (22.04 or equivalent)
CUDA installed
Docker with NVIDIA container support
Node.js
Sufficient disk space (~50 GB recommended)

Step-by-Step Overview

At a high level, the process looks like this:

1. Install NemoClaw

Install the NemoClaw CLI and initialize your environment.

2. Build the Local Inference Engine

Install and compile llama.cpp with GPU support to serve your local model.

3. Download a Model

Download a compatible GGUF model based on your hardware and performance requirements.

4. Start the Model Server

Launch the local inference server and expose an endpoint for requests.

5. Register the Inference Provider

Configure NemoClaw to use your local inference endpoint instead of an external API.

6. Configure the Agent Runtime

Update the configuration so the agent uses your local model.

7. Test the Setup

Verify that inference requests are working and that the agent responds correctly.

For the full setup, commands, and configuration details, refer to the complete tutorial in our docs.

Performance

Performance depends on your GPU and model configuration.

On high-end GPUs like the RTX 6000 Ada:

Q4_K_XL: ~120–140 tokens/sec
Q6_K: ~120–140 tokens/sec with higher output quality

In many cases, performance is limited more by memory bandwidth than raw compute.

Common Issues

When running local inference setups, a few common issues can arise:

Model output formatting errors Ensure your inference server returns OpenAI-compatible responses
Networking issues inside containers Localhost may not work depending on your environment — use the correct IP
Inference endpoint not resolving Make sure requests are routed correctly through your proxy layer
Timeouts during verification Confirm the inference server is reachable from the runtime

Why This Matters

Running NemoClaw with local inference gives you full control over how your AI agents operate.

However, even a single-node setup involves:

GPU configuration
model optimization
container networking
inference routing
runtime configuration

As you scale beyond a single VM, managing this infrastructure becomes significantly more complex.

This is where orchestration, GPU scheduling, and infrastructure automation become critical.

If you want to understand how this fits into production environments, see our guide on

How to Deploy OpenClaw in Production (Docker, Kubernetes, and GPU Infrastructure).

Final Thoughts

Local inference is quickly becoming a key part of running AI agents in production.

With NemoClaw and tools like llama.cpp, it’s now possible to run powerful models locally with strong performance and full control over your infrastructure.

As demand for AI infrastructure grows, understanding how to deploy and manage these systems will become increasingly important.