Skip to Content
getting-startedQuickstart

Last Updated: 3/9/2026


Quickstart

This guide will help you quickly get started with vLLM to perform:

Prerequisites

  • OS: Linux
  • Python: 3.10 — 3.13

Installation

If you are using NVIDIA GPUs, you can install vLLM using pip directly.

It’s recommended to use uv , a very fast Python environment manager, to create and manage Python environments. After installing uv, you can create a new Python environment and install vLLM:

uv venv --python 3.12 --seed source .venv/bin/activate uv pip install vllm --torch-backend=auto

uv can automatically select the appropriate PyTorch index at runtime by inspecting the installed CUDA driver version via --torch-backend=auto.

Alternatively, you can use uv run with --with [dependency] option:

uv run --with vllm vllm --help

You can also use conda to create and manage Python environments:

conda create -n myenv python=3.12 -y conda activate myenv pip install --upgrade uv uv pip install vllm --torch-backend=auto

AMD GPUs

For AMD GPUs, install vLLM using uv:

uv venv --python 3.12 --seed source .venv/bin/activate uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/

Note: Currently supports Python 3.12, ROCm 7.0 and glibc >= 2.35.

TPU

To run vLLM on Google TPUs, install the vllm-tpu package:

uv pip install vllm-tpu

For more detailed instructions, refer to the vLLM on TPU documentation .

Offline Batched Inference

With vLLM installed, you can start generating texts for a list of input prompts (offline batch inferencing).

First, import the necessary classes:

from vllm import LLM, SamplingParams
  • LLM is the main class for running offline inference with vLLM engine
  • SamplingParams specifies the parameters for the sampling process

Define input prompts and sampling parameters:

prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

Important: By default, vLLM uses sampling parameters recommended by the model creator from generation_config.json if it exists. To use vLLM’s default parameters, set generation_config="vllm" when creating the LLM instance.

Initialize the vLLM engine:

llm = LLM(model="facebook/opt-125m")

Note: By default, vLLM downloads models from HuggingFace. To use ModelScope, set the environment variable:

export VLLM_USE_MODELSCOPE=True

Generate outputs:

outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Note: The llm.generate method does not automatically apply the model’s chat template. For Instruct/Chat models, either:

  1. Apply the chat template manually using the tokenizer
  2. Use the llm.chat method with a list of messages

OpenAI-Compatible Server

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.

Start the vLLM server:

vllm serve Qwen/Qwen2.5-1.5B-Instruct

By default, it starts the server at http://localhost:8000. You can specify the address with --host and --port arguments.

Note: By default, the server uses a predefined chat template stored in the tokenizer.

Important: By default, the server applies generation_config.json if it exists. To disable this, pass --generation-config vllm when launching the server.

You can pass API keys using --api-key or the VLLM_API_KEY environment variable.

OpenAI Completions API

Query the completions endpoint:

curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-1.5B-Instruct", "prompt": "San Francisco is a", "max_tokens": 7, "temperature": 0 }'

Or use the OpenAI Python client:

from openai import OpenAI openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) completion = client.completions.create( model="Qwen/Qwen2.5-1.5B-Instruct", prompt="San Francisco is a", ) print("Completion result:", completion)

OpenAI Chat Completions API

vLLM supports the OpenAI Chat Completions API for more dynamic, interactive conversations:

curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-1.5B-Instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the world series in 2020?"} ] }'

Or with the Python client:

from openai import OpenAI openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) chat_response = client.chat.completions.create( model="Qwen/Qwen2.5-1.5B-Instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me a joke."}, ], ) print("Chat response:", chat_response)

Attention Backends

vLLM supports multiple backends for efficient Attention computation. It automatically selects the most performant backend compatible with your system.

To manually set the backend, use the --attention-backend CLI argument:

# For online serving vllm serve Qwen/Qwen2.5-1.5B-Instruct --attention-backend FLASH_ATTN # For offline inference python script.py --attention-backend FLASHINFER

Available backend options:

  • NVIDIA CUDA: FLASH_ATTN or FLASHINFER
  • AMD ROCm: TRITON_ATTN, ROCM_ATTN, ROCM_AITER_FA, ROCM_AITER_UNIFIED_ATTN, TRITON_MLA, ROCM_AITER_MLA, or ROCM_AITER_TRITON_MLA

Note: Flash Infer is not included in pre-built wheels. Install it separately following the Flash Infer official docs .