Serving LLMs at Scale

The Question Nobody Asks Until It Breaks

When you use ChatGPT or Gemini, you type a question and watch words appear on the screen one at a time. Some chatbots feel fast. Others feel slow. Even when they use the same underlying model, the experience can be wildly different.

Why?

The answer has almost nothing to do with the model itself. It has everything to do with the inference engine, the software that actually runs the model and serves it to users.

This is one of the most overlooked parts of working with Large Language Models. Everyone talks about which model is the smartest. Almost nobody talks about how the model is served. But when you go from "I built a demo" to "I run a service that thousands of people use," the inference engine becomes the thing that determines whether your system works or falls over.

What Is "Inference," Really?

When you train an LLM, that is one phase. You feed it data, it learns patterns, and you end up with a model file.

When you use the model to answer a question, that is called inference. The model takes your prompt, processes it, and generates text one token at a time. A token is roughly a word or part of a word.

How an LLM Generates Text

Prompt: "The capital of France is"

generates

Paris

city

known

for

...

One token at a time, auto-regressively

The speed of this output is measured in tokens per second. That is the number you care about. Higher is better. Users wait less, and you can serve more people with the same hardware.

Here is the critical insight: the same model, on the same hardware, can produce wildly different tokens per second depending on which inference engine you use.

The Inference Engine Landscape

There are a lot of inference engines out there. Each one is optimized for a different use case:

llama.cpp

CPU / RAM

Optimized for running models on laptops and consumer hardware without a GPU.

TensorRT-LLM

NVIDIA GPU

NVIDIA's own engine. Best raw performance if you are all-in on NVIDIA.

HuggingFace TGI

General purpose

Popular and easy to use, but not always the fastest under load.

vLLM

Multi-user GPU

High throughput for serving many users at once. The focus of this article.

SGLang

Structured output

Optimized for complex prompts, multi-step reasoning, and structured generation.

LMDeploy

GPU throughput

Alternative GPU-focused engine with quantization support.

Out of all of these, vLLM stands out specifically for one scenario: serving many users at once.

The Multi-User Problem

Serving one person at a time with an LLM is easy. You load the model into GPU memory, process the request, return the output.

The problem starts when you have 10 users, 100 users, or 10,000 users all asking questions at the same time. Now your system has to process multiple requests simultaneously on the same GPU. This is where most naive setups collapse.

Single User vs Multi-User Serving

Easy

Single User

User

calls

LLM Server

One request at a time. Easy to manage.

Hard

Many Users

User User User User ...

all call

LLM Server

Concurrent requests. Memory gets tight. Queues form. Everyone slows down.

The main bottleneck is something called the KV cache.

What Is the KV Cache?

During inference, the model does not just process your prompt once. For every new token it generates, it needs to remember everything that came before. To avoid recomputing this from scratch on every step, it keeps intermediate calculations in a structure called the Key-Value cache, or KV cache for short.

This cache lives in GPU memory. And it can be huge. For long prompts and long outputs, the KV cache can easily consume several gigabytes per user.

The Classic KV Cache Problem

Here is where traditional inference systems waste massive amounts of memory.

When a new request comes in, the system has no idea how long the output will be. Maybe the user asks for a 10-word answer. Maybe they ask for 2000 words. The system does not know in advance.

So what does it do? It pre-allocates memory for the worst case. If the maximum possible output is 2048 tokens, it reserves space for 2048 tokens every single time. Even if the actual output ends up being only 50 tokens.

Traditional KV Cache: Massive Waste

User 1 (asked 50-token answer)

97.5% wasted

User 2 (asked 200-token answer)

90% wasted

User 3 (asked 800-token answer)

60% wasted

Actually used Pre-allocated but wasted

This is like reserving a 20-seat table at a restaurant for a party of 2, just in case 18 more friends show up. Most of the table sits empty.

Studies found that traditional systems waste 60 to 80 percent of their GPU memory this way. That wasted memory is memory that could have been used to serve more concurrent users.

PagedAttention: The vLLM Breakthrough

vLLM introduced a technique called PagedAttention that solves this problem beautifully. And the idea comes from a place you would not expect: how operating systems manage RAM.

Your laptop's operating system does not give a program one big continuous chunk of memory. That would be inefficient. Instead, it breaks memory into small fixed-size pages, usually 4 kilobytes each. When a program needs more memory, the OS hands out more pages. When the program is done, the pages go back into the pool.

This eliminates waste. Memory is allocated on demand, in small chunks, and freed when no longer needed.

vLLM applies this exact same idea to the KV cache. Instead of reserving one giant block of GPU memory per request, it breaks the KV cache into small fixed-size pages. Each request gets only the pages it currently needs. When the request grows, more pages are allocated. When the request finishes, those pages become available for other requests.

Contiguous vs Paged Memory

Contiguous (Old Way)

User 1

wasted

User 2

wasted

Large blocks reserved per user. Most space unused.

Paged (vLLM)

Small pages allocated on demand. Space fully used.

The result: GPU memory utilization jumps from around 20 percent to around 95 percent. That means the same GPU can serve four to five times more concurrent users.

20%

Old systems

GPU memory actually used

95%

vLLM with PagedAttention

GPU memory actually used

Why This Matters in Practice

If you are running a service, this has direct business implications:

Cost. GPUs are expensive. If you can serve 5 times more users on the same GPU, your cost per user drops by 5 times.

Speed under load. A system that handles concurrency well keeps feeling fast even when busy. A poorly designed system slows to a crawl the moment traffic picks up.

Throughput versus latency. vLLM prioritizes overall throughput (total tokens per second across all users). Individual requests might take slightly longer than on a single-user system, but the aggregate system produces far more output per second.

vLLM Is Not Always the Right Choice

vLLM excels at one thing: serving many users concurrently on a GPU. That is not everyone's problem.

Running a model locally on your laptop? Use llama.cpp. It is optimized for CPU and RAM, not GPU throughput.

Running on NVIDIA hardware with maximum vendor optimization? TensorRT-LLM will likely give you the absolute best performance, but it is tightly coupled to NVIDIA's stack.

Low traffic, single user? Almost any engine works. The choice matters less because the multi-user efficiency gap disappears when you only have one user.

OpenAI-Compatible API

One of the quiet but important features of vLLM: it exposes the exact same API format as OpenAI's. That means if your application already talks to the OpenAI API, you can switch to a self-hosted vLLM server by changing one line of configuration: the base URL.

One Line to Swap OpenAI for Self-Hosted vLLM

Using OpenAI

client = OpenAI(
    base_url="https://api.openai.com/v1",
    api_key="sk-..."
)

Using your own vLLM

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

No code changes. No rewriting. Just point the client at your own server.

This makes vLLM a natural migration path for teams that start with OpenAI for prototyping and then want to self-host to cut costs or keep data private.

Tuning Parameters That Actually Matter

When you run vLLM in production, two parameters determine most of the behavior:

max_model_len sets the maximum context window. If your users only send short prompts, lowering this number reduces the memory needed per request, which lets you fit more users in the same GPU.

max_num_seqs limits how many concurrent requests the system processes at once. Higher means more concurrency but more memory pressure. Lower means each user gets more individual attention but fewer people fit at once.

There is no universal "right" value for either. The right setting depends on your workload. Short prompts and many users push you toward lower max_model_len and higher max_num_seqs. Long prompts and fewer users push the opposite direction.

What to Monitor in Production

If you run an LLM service, the minimum set of metrics to watch:

Tokens per second (per user)

How fast individual users experience responses.

Tokens per second (aggregate)

How efficient your GPU is overall across all users.

Latency (first token)

How long a user waits before seeing the first word. Huge impact on perceived speed.

Queue depth

How many requests are waiting. If this grows, scale up.

GPU memory utilization

Tells you whether you can fit more traffic or you are about to fall over.

In practice, teams usually pipe these into Prometheus and Grafana. For a single-server setup, even a basic dashboard is enough to catch the obvious problems.

The One Thing to Remember

For most of the history of LLMs, the conversation has been about which model is better. That is still important, but increasingly, the bottleneck is not the model. It is the serving infrastructure around it.

A mediocre model served well beats a great model served badly, especially when users are waiting for responses.

vLLM is not magic. It is just a carefully engineered system that took an old operating systems idea (paging) and applied it to the specific problem of KV cache management. That one insight turned a 20 percent utilization system into a 95 percent utilization system, and made it possible to economically serve LLMs to many users on modest hardware.

If you are building anything beyond a single-user demo, understanding this layer is not optional.