Last Updated: 3/9/2026

Quickstart

This guide will help you quickly get started with vLLM to perform:

Offline batched inference
Online serving using OpenAI-compatible server

Prerequisites

OS: Linux
Python: 3.10 — 3.13

Installation

If you are using NVIDIA GPUs, you can install vLLM using pip directly.

It’s recommended to use uv , a very fast Python environment manager, to create and manage Python environments. After installing uv, you can create a new Python environment and install vLLM:


uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto

uv can automatically select the appropriate PyTorch index at runtime by inspecting the installed CUDA driver version via --torch-backend=auto.

Alternatively, you can use uv run with --with [dependency] option:


uv run --with vllm vllm --help

You can also use conda to create and manage Python environments:


conda create -n myenv python=3.12 -y
conda activate myenv
pip install --upgrade uv
uv pip install vllm --torch-backend=auto

AMD GPUs

For AMD GPUs, install vLLM using uv:


uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/

Note: Currently supports Python 3.12, ROCm 7.0 and glibc >= 2.35.

TPU

To run vLLM on Google TPUs, install the vllm-tpu package:


uv pip install vllm-tpu

For more detailed instructions, refer to the vLLM on TPU documentation .

Offline Batched Inference

With vLLM installed, you can start generating texts for a list of input prompts (offline batch inferencing).

First, import the necessary classes:


from vllm import LLM, SamplingParams

LLM is the main class for running offline inference with vLLM engine
SamplingParams specifies the parameters for the sampling process

Define input prompts and sampling parameters:


prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

Important: By default, vLLM uses sampling parameters recommended by the model creator from generation_config.json if it exists. To use vLLM’s default parameters, set generation_config="vllm" when creating the LLM instance.

Initialize the vLLM engine:


llm = LLM(model="facebook/opt-125m")

Note: By default, vLLM downloads models from HuggingFace. To use ModelScope, set the environment variable:


export VLLM_USE_MODELSCOPE=True

Generate outputs:


outputs = llm.generate(prompts, sampling_params)
 
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Note: The llm.generate method does not automatically apply the model’s chat template. For Instruct/Chat models, either:

Apply the chat template manually using the tokenizer
Use the llm.chat method with a list of messages

OpenAI-Compatible Server

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.

Start the vLLM server:


vllm serve Qwen/Qwen2.5-1.5B-Instruct

By default, it starts the server at http://localhost:8000. You can specify the address with --host and --port arguments.

Note: By default, the server uses a predefined chat template stored in the tokenizer.

Important: By default, the server applies generation_config.json if it exists. To disable this, pass --generation-config vllm when launching the server.

You can pass API keys using --api-key or the VLLM_API_KEY environment variable.

OpenAI Completions API

Query the completions endpoint:


curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
  }'

Or use the OpenAI Python client:


from openai import OpenAI
 
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
 
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
 
completion = client.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    prompt="San Francisco is a",
)
print("Completion result:", completion)

OpenAI Chat Completions API

vLLM supports the OpenAI Chat Completions API for more dynamic, interactive conversations:


curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Who won the world series in 2020?"}
    ]
  }'

Or with the Python client:


from openai import OpenAI
 
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
 
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
 
chat_response = client.chat.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke."},
    ],
)
print("Chat response:", chat_response)

Attention Backends

vLLM supports multiple backends for efficient Attention computation. It automatically selects the most performant backend compatible with your system.

To manually set the backend, use the --attention-backend CLI argument:


# For online serving
vllm serve Qwen/Qwen2.5-1.5B-Instruct --attention-backend FLASH_ATTN
 
# For offline inference
python script.py --attention-backend FLASHINFER

Available backend options:

NVIDIA CUDA: FLASH_ATTN or FLASHINFER
AMD ROCm: TRITON_ATTN, ROCM_ATTN, ROCM_AITER_FA, ROCM_AITER_UNIFIED_ATTN, TRITON_MLA, ROCM_AITER_MLA, or ROCM_AITER_TRITON_MLA

Note: Flash Infer is not included in pre-built wheels. Install it separately following the Flash Infer official docs .