Skip to content

v0.5.2

Compare
Choose a tag to compare
@github-actions github-actions released this 15 Jul 18:01
· 2468 commits to main since this release
4cf256a

Major Changes

  • ❗Planned breaking change ❗: we plan to remove beam search (see more in #6226) in the next few releases. This release come with a warning when beam search is enabled for the request. Please voice your concern in the RFC if you do have a valid use case for beam search in vLLM
  • The release has moved to a Python version agnostic wheel (#6394). A single wheel can be installed across Python versions vLLM supports.

Highlights

Model Support

Hardware

  • AMD: unify CUDA_VISIBLE_DEVICES usage (#6352)

Performance

  • ZeroMQ fallback for broadcasting large objects (#6183)
  • Simplify code to support pipeline parallel (#6406)
  • Turn off CUTLASS scaled_mm for Ada Lovelace (#6384)
  • Use CUTLASS kernels for the FP8 layers with Bias (#6270)

Features

  • Enabling bonus token in speculative decoding for KV cache based models (#5765)
  • Medusa Implementation with Top-1 proposer (#4978)
  • An experimental vLLM CLI for serving and querying OpenAI compatible server (#5090)

Others

  • Add support for multi-node on CI (#5955)
  • Benchmark: add H100 suite (#6047)
  • [CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy (#5362)
  • Build some nightly wheels (#6380)

What's Changed

  • Update wheel builds to strip debug by @simon-mo in #6161
  • Fix release wheel build env var by @simon-mo in #6162
  • Move release wheel env var to Dockerfile instead by @simon-mo in #6163
  • [Doc] Reorganize Supported Models by Type by @ywang96 in #6167
  • [Doc] Move guide for multimodal model and other improvements by @DarkLight1337 in #6168
  • [Model] Add PaliGemma by @ywang96 in #5189
  • add benchmark for fix length input and output by @haichuan1221 in #5857
  • [ Misc ] Support Fp8 via llm-compressor by @robertgshaw2-neuralmagic in #6110
  • [misc][frontend] log all available endpoints by @youkaichao in #6195
  • do not exclude object field in CompletionStreamResponse by @kczimm in #6196
  • [Bugfix] FIx benchmark args for randomly sampled dataset by @haichuan1221 in #5947
  • [Kernel] reloading fused_moe config on the last chunk by @avshalomman in #6210
  • [Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) by @afeldman-nm in #4888
  • [Bugfix] use diskcache in outlines _get_guide #5436 by @ericperfect in #6203
  • [Bugfix] Mamba cache Cuda Graph padding by @tomeras91 in #6214
  • Add FlashInfer to default Dockerfile by @simon-mo in #6172
  • [hardware][cuda] use device id under CUDA_VISIBLE_DEVICES for get_device_capability by @youkaichao in #6216
  • [core][distributed] fix ray worker rank assignment by @youkaichao in #6235
  • [Bugfix][TPU] Add missing None to model input by @WoosukKwon in #6245
  • [Bugfix][TPU] Fix outlines installation in TPU Dockerfile by @WoosukKwon in #6256
  • Add support for multi-node on CI by @khluu in #5955
  • [CORE] Adding support for insertion of soft-tuned prompts by @SwapnilDreams100 in #4645
  • [Docs] Docs update for Pipeline Parallel by @andoorve in #6222
  • [Bugfix]fix and needs_scalar_to_array logic check by @qibaoyuan in #6238
  • [Speculative Decoding] Medusa Implementation with Top-1 proposer by @abhigoyal1997 in #4978
  • [core][distributed] add zmq fallback for broadcasting large objects by @youkaichao in #6183
  • [Bugfix][TPU] Add prompt adapter methods to TPUExecutor by @WoosukKwon in #6279
  • [Doc] Guide for adding multi-modal plugins by @DarkLight1337 in #6205
  • [Bugfix] Support 2D input shape in MoE layer by @WoosukKwon in #6287
  • [Bugfix] MLPSpeculator: Use ParallelLMHead in tie_weights=False case. by @tdoublep in #6303
  • [CI/Build] Enable mypy typing for remaining folders by @bmuskalla in #6268
  • [Bugfix] OpenVINOExecutor abstractmethod error by @park12sj in #6296
  • [Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models by @sroy745 in #5765
  • [Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor by @WoosukKwon in #6313
  • [Doc] Remove comments incorrectly copied from another project by @daquexian in #6286
  • [Doc] Update description of vLLM support for CPUs by @DamonFool in #6003
  • [BugFix]: set outlines pkg version by @xiangyang-95 in #6262
  • [Bugfix] Fix snapshot download in serving benchmark by @ywang96 in #6318
  • [Misc] refactor(config): clean up unused code by @aniaan in #6320
  • [BugFix]: fix engine timeout due to request abort by @pushan01 in #6255
  • [Bugfix] GPTBigCodeForCausalLM: Remove lm_head from supported_lora_modules. by @tdoublep in #6326
  • [BugFix] get_and_reset only when scheduler outputs are not empty by @mzusman in #6266
  • [ Misc ] Refactor Marlin Python Utilities by @robertgshaw2-neuralmagic in #6082
  • Benchmark: add H100 suite by @simon-mo in #6047
  • [bug fix] Fix llava next feature size calculation. by @xwjiang2010 in #6339
  • [doc] update pipeline parallel in readme by @youkaichao in #6347
  • [CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy by @KuntaiDu in #5362
  • [ BugFix ] Prompt Logprobs Detokenization by @robertgshaw2-neuralmagic in #6223
  • [Misc] Remove flashinfer warning, add flashinfer tests to CI by @LiuXiaoxuanPKU in #6351
  • [distributed][misc] keep consistent with how pytorch finds libcudart.so by @youkaichao in #6346
  • [Bugfix] Fix usage stats logging exception warning with OpenVINO by @helena-intel in #6349
  • [Model][Phi3-Small] Remove scipy from blocksparse_attention by @mgoin in #6343
  • [CI/Build] (2/2) Switching AMD CI to store images in Docker Hub by @adityagoel14 in #6350
  • [ROCm][AMD][Bugfix] unify CUDA_VISIBLE_DEVICES usage in vllm to get device count and fixed navi3x by @hongxiayang in #6352
  • [ Misc ] Remove separate bias add by @robertgshaw2-neuralmagic in #6353
  • [Misc][Bugfix] Update transformers for tokenizer issue by @ywang96 in #6364
  • [ Misc ] Support Models With Bias in compressed-tensors integration by @robertgshaw2-neuralmagic in #6356
  • [Bugfix] Fix dtype mismatch in PaliGemma by @DarkLight1337 in #6367
  • [Build/CI] Checking/Waiting for the GPU's clean state by @Alexei-V-Ivanov-AMD in #6379
  • [Misc] add fixture to guided processor tests by @kevinbu233 in #6341
  • [ci] Add grouped tests & mark tests to run by default for fastcheck pipeline by @khluu in #6365
  • [ci] Add GHA workflows to enable full CI run by @khluu in #6381
  • [MISC] Upgrade dependency to PyTorch 2.3.1 by @comaniac in #5327
  • Build some nightly wheels by default by @simon-mo in #6380
  • Fix release-pipeline.yaml by @simon-mo in #6388
  • Fix interpolation in release pipeline by @simon-mo in #6389
  • Fix release pipeline's -e flag by @simon-mo in #6390
  • [Bugfix] Fix illegal memory access in FP8 MoE kernel by @comaniac in #6382
  • [Misc] Add generated git commit hash as vllm.__commit__ by @mgoin in #6386
  • Fix release pipeline's dir permission by @simon-mo in #6391
  • [Bugfix][TPU] Fix megacore setting for v5e-litepod by @WoosukKwon in #6397
  • [ci] Fix wording for GH bot by @khluu in #6398
  • [Doc] Fix Typo in Doc by @esaliya in #6392
  • [Bugfix] Fix hard-coded value of x in context_attention_fwd by @tdoublep in #6373
  • [Docs] Clean up latest news by @WoosukKwon in #6401
  • [ci] try to add multi-node tests by @youkaichao in #6280
  • Updating LM Format Enforcer version to v10.3 by @noamgat in #6411
  • [ Misc ] More Cleanup of Marlin by @robertgshaw2-neuralmagic in #6359
  • [Misc] Add deprecation warning for beam search by @WoosukKwon in #6402
  • [ Misc ] Apply MoE Refactor to Qwen2 + Deepseekv2 To Support Fp8 by @robertgshaw2-neuralmagic in #6417
  • [Model] Initialize Fuyu-8B support by @Isotr0py in #3924
  • Remove unnecessary trailing period in spec_decode.rst by @terrytangyuan in #6405
  • [Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace by @tlrmchlsmth in #6384
  • [ci][build] fix commit id by @youkaichao in #6420
  • [ Misc ] Enable Quantizing All Layers of DeekSeekv2 by @robertgshaw2-neuralmagic in #6423
  • [Feature] vLLM CLI for serving and querying OpenAI compatible server by @EthanqX in #5090
  • [Doc] xpu backend requires running setvars.sh by @rscohn2 in #6393
  • [CI/Build] Cross python wheel by @robertgshaw2-neuralmagic in #6394
  • [Bugfix] Benchmark serving script used global parameter 'args' in function 'sample_random_requests' by @lxline in #6428
  • Report usage for beam search by @simon-mo in #6404
  • Add FUNDING.yml by @simon-mo in #6435
  • [BugFix] BatchResponseData body should be optional by @zifeitong in #6345
  • [Doc] add env docs for flashinfer backend by @DefTruth in #6437
  • [core][distributed] simplify code to support pipeline parallel by @youkaichao in #6406
  • [Bugfix] Convert image to RGB by default by @DarkLight1337 in #6430
  • [doc][misc] doc update by @youkaichao in #6439
  • [VLM] Minor space optimization for ClipVisionModel by @ywang96 in #6436
  • [doc][distributed] add suggestion for distributed inference by @youkaichao in #6418
  • [Kernel] Use CUTLASS kernels for the FP8 layers with Bias by @tlrmchlsmth in #6270
  • [Misc] Use 0.0.9 version for flashinfer by @Pernekhan in #6447
  • [Bugfix] Add custom Triton cache manager to resolve MoE MP issue by @tdoublep in #6140
  • [Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF by @tdoublep in #6409
  • bump version to v0.5.2 by @simon-mo in #6433
  • [misc][distributed] fix pp missing layer condition by @youkaichao in #6446

New Contributors

Full Changelog: v0.5.1...v0.5.2