vLLM OpenAI-Compatible Server: A Drop-In Replacement for the OpenAI API

If your app already talks to the OpenAI API, you do not have to rewrite it to run your own model. vLLM ships a server that mirrors the OpenAI API surface, so you point your existing code at a vLLM endpoint by changing two things: the base URL and the API key. That is the whole pitch behind the vllm/vllm-openai image, and it is the fastest path from "we use the OpenAI API" to "we run our own model on our own GPU."

This post is about that compatibility layer specifically: what "OpenAI-compatible" actually covers, how to point existing code at it, what to double-check, and where to host it. If you want the deeper production story, concurrency, failover, autoscaling, that is in our vLLM in production with Docker guide.

TL;DR

vLLM's OpenAI-compatible server exposes the same endpoints as OpenAI: /v1/chat/completions, /v1/completions, and /v1/models. Requests and responses use the standard OpenAI objects, so the official OpenAI SDK works unchanged once you override base_url and api_key. The vllm/vllm-openai Docker image gives you that server in one command. The catch is that compatibility is broad but not total, so test the specific parameters your app depends on. On Yotta Labs you run the same image either as a GPU Pod you control or as a Serverless endpoint that scales workers for you.

What "OpenAI-compatible" actually means

It means vLLM serves the same HTTP routes your OpenAI client already calls. When the server boots, you will see them register:

INFO: Route: /v1/chat/completions, Methods: POST
INFO: Route: /v1/completions, Methods: POST
INFO: Route: /v1/models, Methods: GET
INFO: Application startup complete.

A request to /v1/chat/completions takes the usual model, messages, max_tokens, and temperature fields, and the response comes back as a standard chat completion object with choices, message, and usage. Any tool, framework, or SDK written against OpenAI treats it like the real thing because, on the wire, it looks like the real thing.

The practical payoff is that you keep your code. The work of switching to an open model becomes a hosting decision, not an application rewrite.

The vllm/vllm-openai image

The official image is vllm/vllm-openai. vLLM and its CUDA dependencies are baked in, so you do not install anything yourself. One command gives you a running server:

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-7B-Instruct \
  --gpu-memory-utilization 0.90

The model weights download on first start, so a 7B model takes a few minutes before the server is ready on port 8000. From there you have the OpenAI-compatible API live.

Point your existing OpenAI code at it

This is the part that makes the compatibility worth it. Take whatever you already wrote against OpenAI and change the base URL and key.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed-locally",  # set a real key once auth sits in front
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is continuous batching?"},
    ],
    max_tokens=256,
)

print(response.choices[0].message.content)

Or hit it with curl to confirm it works before wiring up the SDK:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain PagedAttention in two sentences."}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

Note that the model field has to match the model you actually served, not an OpenAI model name. That is the most common first mistake. You pass Qwen/Qwen2.5-7B-Instruct, not gpt-4.

What carries over and what to check

Compatibility is broad, but treat it as broad rather than total. The core chat and completion calls behave the way your OpenAI code expects. Streaming works by adding "stream": true. The standard sampling parameters carry over.

Where to slow down: features that sit at the edges of the OpenAI spec, like function and tool calling, structured output, logprobs, and certain newer parameters, can vary by vLLM version and by model. The safe move is to test the exact calls your application makes against the version of vLLM you plan to run, rather than assuming every OpenAI feature maps one to one. Pin the vLLM version once you have a setup that passes, so an image update does not quietly change behavior under you.

This is not a knock on vLLM. It is the normal reality of any compatibility layer. The 90 percent that carries over is what saves you the rewrite. The 10 percent is what you verify before you ship.

Running it on Yotta Labs

The same vllm/vllm-openai image you ran locally is the image you run on Yotta. There are two paths depending on how much control you want.

GPU Pods, for full control. A Pod is a GPU container you run your own image on. You deploy vllm/vllm-openai, expose port 8000, and get a vLLM endpoint with SSH and HTTP access plus live GPU, CPU, and memory metrics in the console. You operate it like a server, which is what you want when you need control over the exact runtime.

Serverless, for autoscaling. Yotta Serverless runs the same image as a managed endpoint with elastic worker scaling, multi-region scheduling, and built-in failover. You create an endpoint pointing at vllm/vllm-openai:latest with a vllm serve initialization command, pick a GPU type and region, and the platform provisions the worker and exposes the OpenAI-compatible API. Yotta's Serverless LLM tutorial walks the full sequence.

On the Serverless path the call pattern is slightly different from local. Once the endpoint is running you get a domain, and you call it using your Yotta API key as a bearer token:

from openai import OpenAI

client = OpenAI(
    base_url="https://YOUR_ENDPOINT_DOMAIN/v1",
    api_key="YOUR_API_KEY",  # Yotta Labs API key, used as the bearer token
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is PagedAttention?"},
    ],
    max_tokens=256,
    temperature=0.7,
)

print(response.choices[0].message.content)

Same OpenAI SDK, same two-line change, now pointed at a managed endpoint instead of localhost. For the production concerns that show up once real traffic hits, scaling, cold starts, multi-GPU models, the vLLM in production with Docker guide goes deeper.

If you would rather not run an inference server at all, Yotta's AI Gateway gives you an OpenAI-compatible endpoint to hosted models without deploying anything. Self-hosting vLLM makes sense when you need a specific open model, your own fine-tuned weights, or control over the runtime. If you just need a model behind an API, the Gateway is less work.

FAQ

What is the difference between vLLM and the vllm/vllm-openai image? vLLM is the inference engine. The vllm/vllm-openai image is the engine packaged with an OpenAI-compatible API server, so you get the /v1/chat/completions style endpoints out of the box without building the server yourself.

Do I have to change my application code? No. Because the server is OpenAI-compatible, you change base_url to your endpoint and set api_key. Existing OpenAI SDK code works unchanged. The one edit beyond that is using your served model name in the model field instead of an OpenAI model name.

Does streaming work? Yes. Add "stream": true to the request, the same way you would with OpenAI. The server returns chunked responses.

Are tool calling and structured outputs supported? Often, but support varies by vLLM version and by model. Test the exact feature your app relies on against the version you plan to run, rather than assuming full parity.

Why does the server reject my request with a model error? The model field has to match the model you served, for example Qwen/Qwen2.5-7B-Instruct. Passing an OpenAI model name like gpt-4 will not resolve.

Should I use Pods or Serverless on Yotta? Use Pods when you want full control and a long-running server you manage. Use Serverless when you want elastic scaling, multi-region scheduling, and failover without operating the infrastructure yourself.

Is this the same as Yotta's AI Gateway? No. The vLLM OpenAI server is your own deployed endpoint running a model you chose. The AI Gateway is a managed aggregator that gives you one OpenAI-compatible API across hosted models without deploying anything.

Bottom line

The reason teams reach for the vLLM OpenAI-compatible server is that it turns "switch to an open model" into a two-line change instead of a rewrite. Point your base URL and key at the endpoint and your existing OpenAI code runs against a model you control. Verify the handful of features that live at the edge of the spec, pin your version, and you have a real OpenAI alternative running on your own GPU.

Deploy a vLLM endpoint on Yotta from the console, follow the Serverless LLM tutorial for the full sequence, or read the production Docker guide before you size your GPUs.

TL;DR

What "OpenAI-compatible" actually means

It means vLLM serves the same HTTP routes your OpenAI client already calls. When the server boots, you will see them register:

INFO: Route: /v1/chat/completions, Methods: POST
INFO: Route: /v1/completions, Methods: POST
INFO: Route: /v1/models, Methods: GET
INFO: Application startup complete.

The practical payoff is that you keep your code. The work of switching to an open model becomes a hosting decision, not an application rewrite.

The vllm/vllm-openai image

The official image is vllm/vllm-openai. vLLM and its CUDA dependencies are baked in, so you do not install anything yourself. One command gives you a running server:

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model Qwen/Qwen2.5-7B-Instruct \
  --gpu-memory-utilization 0.90

The model weights download on first start, so a 7B model takes a few minutes before the server is ready on port 8000. From there you have the OpenAI-compatible API live.

Point your existing OpenAI code at it

This is the part that makes the compatibility worth it. Take whatever you already wrote against OpenAI and change the base URL and key.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed-locally",  # set a real key once auth sits in front
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is continuous batching?"},
    ],
    max_tokens=256,
)

print(response.choices[0].message.content)

Or hit it with curl to confirm it works before wiring up the SDK:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain PagedAttention in two sentences."}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

Note that the model field has to match the model you actually served, not an OpenAI model name. That is the most common first mistake. You pass Qwen/Qwen2.5-7B-Instruct, not gpt-4.

What carries over and what to check

This is not a knock on vLLM. It is the normal reality of any compatibility layer. The 90 percent that carries over is what saves you the rewrite. The 10 percent is what you verify before you ship.

Running it on Yotta Labs

The same vllm/vllm-openai image you ran locally is the image you run on Yotta. There are two paths depending on how much control you want.

On the Serverless path the call pattern is slightly different from local. Once the endpoint is running you get a domain, and you call it using your Yotta API key as a bearer token:

from openai import OpenAI

client = OpenAI(
    base_url="https://YOUR_ENDPOINT_DOMAIN/v1",
    api_key="YOUR_API_KEY",  # Yotta Labs API key, used as the bearer token
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is PagedAttention?"},
    ],
    max_tokens=256,
    temperature=0.7,
)

print(response.choices[0].message.content)

FAQ

Does streaming work? Yes. Add "stream": true to the request, the same way you would with OpenAI. The server returns chunked responses.

Bottom line

Deploy a vLLM endpoint on Yotta from the console, follow the Serverless LLM tutorial for the full sequence, or read the production Docker guide before you size your GPUs.

vLLM OpenAI-Compatible Server: A Drop-In Replacement for the OpenAI API

TL;DR

What "OpenAI-compatible" actually means

The vllm/vllm-openai image

Point your existing OpenAI code at it

What carries over and what to check

Running it on Yotta Labs

FAQ

Bottom line

You Might Also Like

vLLM OpenAI-Compatible Server: A Drop-In Replacement for the OpenAI API

TL;DR

What "OpenAI-compatible" actually means

The vllm/vllm-openai image

Point your existing OpenAI code at it

What carries over and what to check

Running it on Yotta Labs

FAQ

Bottom line

You Might Also Like