The Question Nobody Asks Until It Breaks
When you use ChatGPT or Gemini, you type a question and watch words appear on the screen one at a time. Some chatbots feel fast. Others feel slow. Even when they use the same underlying model, the experience can be wildly different.
Why?
The answer has almost nothing to do with the model itself. It has everything to do with the inference engine, the software that actually runs the model and serves it to users.
This is one of the most overlooked parts of working with Large Language Models. Everyone talks about which model is the smartest. Almost nobody talks about how the model is served. But when you go from "I built a demo" to "I run a service that thousands of people use," the inference engine becomes the thing that determines whether your system works or falls over.
What Is "Inference," Really?
When you train an LLM, that is one phase. You feed it data, it learns patterns, and you end up with a model file.
When you use the model to answer a question, that is called inference. The model takes your prompt, processes it, and generates text one token at a time. A token is roughly a word or part of a word.
The speed of this output is measured in tokens per second. That is the number you care about. Higher is better. Users wait less, and you can serve more people with the same hardware.
Here is the critical insight: the same model, on the same hardware, can produce wildly different tokens per second depending on which inference engine you use.
The Inference Engine Landscape
There are a lot of inference engines out there. Each one is optimized for a different use case:
Out of all of these, vLLM stands out specifically for one scenario: serving many users at once.
The Multi-User Problem
Serving one person at a time with an LLM is easy. You load the model into GPU memory, process the request, return the output.
The problem starts when you have 10 users, 100 users, or 10,000 users all asking questions at the same time. Now your system has to process multiple requests simultaneously on the same GPU. This is where most naive setups collapse.
The main bottleneck is something called the KV cache.
What Is the KV Cache?
During inference, the model does not just process your prompt once. For every new token it generates, it needs to remember everything that came before. To avoid recomputing this from scratch on every step, it keeps intermediate calculations in a structure called the Key-Value cache, or KV cache for short.
This cache lives in GPU memory. And it can be huge. For long prompts and long outputs, the KV cache can easily consume several gigabytes per user.
The Classic KV Cache Problem
Here is where traditional inference systems waste massive amounts of memory.
When a new request comes in, the system has no idea how long the output will be. Maybe the user asks for a 10-word answer. Maybe they ask for 2000 words. The system does not know in advance.
So what does it do? It pre-allocates memory for the worst case. If the maximum possible output is 2048 tokens, it reserves space for 2048 tokens every single time. Even if the actual output ends up being only 50 tokens.
This is like reserving a 20-seat table at a restaurant for a party of 2, just in case 18 more friends show up. Most of the table sits empty.
Studies found that traditional systems waste 60 to 80 percent of their GPU memory this way. That wasted memory is memory that could have been used to serve more concurrent users.
PagedAttention: The vLLM Breakthrough
vLLM introduced a technique called PagedAttention that solves this problem beautifully. And the idea comes from a place you would not expect: how operating systems manage RAM.
Your laptop's operating system does not give a program one big continuous chunk of memory. That would be inefficient. Instead, it breaks memory into small fixed-size pages, usually 4 kilobytes each. When a program needs more memory, the OS hands out more pages. When the program is done, the pages go back into the pool.
This eliminates waste. Memory is allocated on demand, in small chunks, and freed when no longer needed.
vLLM applies this exact same idea to the KV cache. Instead of reserving one giant block of GPU memory per request, it breaks the KV cache into small fixed-size pages. Each request gets only the pages it currently needs. When the request grows, more pages are allocated. When the request finishes, those pages become available for other requests.
The result: GPU memory utilization jumps from around 20 percent to around 95 percent. That means the same GPU can serve four to five times more concurrent users.
Why This Matters in Practice
If you are running a service, this has direct business implications:
Cost. GPUs are expensive. If you can serve 5 times more users on the same GPU, your cost per user drops by 5 times.
Speed under load. A system that handles concurrency well keeps feeling fast even when busy. A poorly designed system slows to a crawl the moment traffic picks up.
Throughput versus latency. vLLM prioritizes overall throughput (total tokens per second across all users). Individual requests might take slightly longer than on a single-user system, but the aggregate system produces far more output per second.
vLLM Is Not Always the Right Choice
vLLM excels at one thing: serving many users concurrently on a GPU. That is not everyone's problem.
Running a model locally on your laptop? Use llama.cpp. It is optimized for CPU and RAM, not GPU throughput.
Running on NVIDIA hardware with maximum vendor optimization? TensorRT-LLM will likely give you the absolute best performance, but it is tightly coupled to NVIDIA's stack.
Low traffic, single user? Almost any engine works. The choice matters less because the multi-user efficiency gap disappears when you only have one user.
OpenAI-Compatible API
One of the quiet but important features of vLLM: it exposes the exact same API format as OpenAI's. That means if your application already talks to the OpenAI API, you can switch to a self-hosted vLLM server by changing one line of configuration: the base URL.
client = OpenAI(
base_url="https://api.openai.com/v1",
api_key="sk-..."
)
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
No code changes. No rewriting. Just point the client at your own server.
This makes vLLM a natural migration path for teams that start with OpenAI for prototyping and then want to self-host to cut costs or keep data private.
Tuning Parameters That Actually Matter
When you run vLLM in production, two parameters determine most of the behavior:
max_model_len sets the maximum context window. If your users only send short prompts, lowering this number reduces the memory needed per request, which lets you fit more users in the same GPU.
max_num_seqs limits how many concurrent requests the system processes at once. Higher means more concurrency but more memory pressure. Lower means each user gets more individual attention but fewer people fit at once.
There is no universal "right" value for either. The right setting depends on your workload. Short prompts and many users push you toward lower max_model_len and higher max_num_seqs. Long prompts and fewer users push the opposite direction.
What to Monitor in Production
If you run an LLM service, the minimum set of metrics to watch:
In practice, teams usually pipe these into Prometheus and Grafana. For a single-server setup, even a basic dashboard is enough to catch the obvious problems.
The One Thing to Remember
For most of the history of LLMs, the conversation has been about which model is better. That is still important, but increasingly, the bottleneck is not the model. It is the serving infrastructure around it.
A mediocre model served well beats a great model served badly, especially when users are waiting for responses.
vLLM is not magic. It is just a carefully engineered system that took an old operating systems idea (paging) and applied it to the specific problem of KV cache management. That one insight turned a 20 percent utilization system into a 95 percent utilization system, and made it possible to economically serve LLMs to many users on modest hardware.
If you are building anything beyond a single-user demo, understanding this layer is not optional.