Last Updated: 3/9/2026
Welcome to vLLM
Easy, fast, and cheap LLM serving for everyone
vLLM is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
Where to Get Started
Where to get started with vLLM depends on the type of user. If you are looking to:
- Run open-source models on vLLM, we recommend starting with the Quickstart Guide
- Build applications with vLLM, we recommend starting with the User Guide
- Build vLLM, we recommend starting with the Developer Guide
For information about the development of vLLM, see:
Why vLLM is Fast
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: GPTQ, AWQ, INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
- Speculative decoding
- Chunked prefill
Why vLLM is Flexible
vLLM is flexible and easy to use with:
- Seamless integration with popular HuggingFace models
- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
- Distributed inference support: tensor, pipeline, data and expert parallelism
- Streaming outputs
- OpenAI-compatible API server
- Broad hardware support: NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU
- Prefix caching support
- Multi-LoRA support
Learn More
For more information, check out the following:
- vLLM announcing blog post (intro to PagedAttention)
- vLLM paper (SOSP 2023)
- How continuous batching enables 23x throughput in LLM inference by Cade Daniel et al.
Community