Skip to Content
Index

Last Updated: 3/9/2026


Welcome to vLLM

Easy, fast, and cheap LLM serving for everyone

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab  at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Where to Get Started

Where to get started with vLLM depends on the type of user. If you are looking to:

  • Run open-source models on vLLM, we recommend starting with the Quickstart Guide
  • Build applications with vLLM, we recommend starting with the User Guide
  • Build vLLM, we recommend starting with the Developer Guide

For information about the development of vLLM, see:

Why vLLM is Fast

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention 
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantization: GPTQ, AWQ, INT4, INT8, and FP8
  • Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
  • Speculative decoding
  • Chunked prefill

Why vLLM is Flexible

vLLM is flexible and easy to use with:

  • Seamless integration with popular HuggingFace models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Distributed inference support: tensor, pipeline, data and expert parallelism
  • Streaming outputs
  • OpenAI-compatible API server
  • Broad hardware support: NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU
  • Prefix caching support
  • Multi-LoRA support

Learn More

For more information, check out the following:

Community