Last Updated: 3/9/2026

Welcome to vLLM

Easy, fast, and cheap LLM serving for everyone

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Where to Get Started

Where to get started with vLLM depends on the type of user. If you are looking to:

Run open-source models on vLLM, we recommend starting with the Quickstart Guide
Build applications with vLLM, we recommend starting with the User Guide
Build vLLM, we recommend starting with the Developer Guide

For information about the development of vLLM, see:

Why vLLM is Fast

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, INT4, INT8, and FP8
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
Speculative decoding
Chunked prefill

Why vLLM is Flexible

vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Distributed inference support: tensor, pipeline, data and expert parallelism
Streaming outputs
OpenAI-compatible API server
Broad hardware support: NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU
Prefix caching support
Multi-LoRA support

Learn More

For more information, check out the following:

vLLM announcing blog post (intro to PagedAttention)
vLLM paper (SOSP 2023)
How continuous batching enables 23x throughput in LLM inference by Cade Daniel et al.

Welcome to vLLM

Where to Get Started

Why vLLM is Fast

Why vLLM is Flexible

Learn More

Community