v0.5.0
Highlights
Production Features
- FP8 support is ready for testing. By quantizing the portion model weights to 8 bit precision float point, the inference speed gets 1.5x boost. Please try it out and let us know your thoughts! (#5352, #5388, #5159, #5238, #5294, #5183, #5144, #5231)
- Add OpenAI Vision API support. Currently only LLaVA and LLaVA-NeXT are supported. We are working on adding more models in the next release. (#5237, #5383, #4199, #5374, #4197)
- Speculative Decoding and Automatic Prefix Caching is also ready for testing, we plan to turn them on by default in upcoming releases. (#5400, #5157, #5137, #5324)
- Default to multiprocessing backend for single-node distributed case (#5230)
- Support bitsandbytes quantization and QLoRA (#4776)
Hardware Support
- Improvements to the Intel CPU CI (#4113, #5241)
- Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047)
Others
- Debugging tips documentation (#5409, #5430)
- Dynamic Per-Token Activation Quantization (#5037)
- Customizable RoPE theta (#5197)
- Enable passing multiple LoRA adapters at once to generate() (#5300)
- OpenAI
tools
support named functions (#5032) - Support
stream_options
for OpenAI protocol (#5319, #5135) - Update Outlines Integration from
FSM
toGuide
(#4109)
What's Changed
- [CI/Build] CMakeLists: build all extensions' cmake targets at the same time by @dtrifiro in #5034
- [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU by @tlrmchlsmth in #5137
- [Kernel] Update Cutlass fp8 configs by @varun-sundar-rabindranath in #5144
- [Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py by @dashanji in #5151
- [Bugfix] Fix call to init_logger in openai server by @NadavShmayo in #4765
- [Feature][Kernel] Support bitsandbytes quantization and QLoRA by @chenqianfzh in #4776
- [Bugfix] Remove deprecated @abstractproperty by @zhuohan123 in #5174
- [Bugfix]: Fix issues related to prefix caching example (#5177) by @Delviet in #5180
- [BugFix] Prevent
LLM.encode
for non-generation Models by @robertgshaw2-neuralmagic in #5184 - Update test_ignore_eos by @simon-mo in #4898
- [Frontend][OpenAI] Support for returning max_model_len on /v1/models response by @Avinash-Raj in #4643
- [Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer by @divakar-amd in #4927
- [Misc] Simplify code and fix type annotations in
conftest.py
by @DarkLight1337 in #5118 - [Core] Support image processor by @DarkLight1337 in #4197
- [Core] Remove unnecessary copies in flash attn backend by @Yard1 in #5138
- [Kernel] Pass a device pointer into the quantize kernel for the scales by @tlrmchlsmth in #5159
- [CI/BUILD] enable intel queue for longer CPU tests by @zhouyuan in #4113
- [Misc]: Implement CPU/GPU swapping in BlockManagerV2 by @Kaiyang-Chen in #3834
- New CI template on AWS stack by @khluu in #5110
- [FRONTEND] OpenAI
tools
support named functions by @br3no in #5032 - [Bugfix] Support
prompt_logprobs==0
by @toslunar in #5217 - [Bugfix] Add warmup for prefix caching example by @zhuohan123 in #5235
- [Kernel] Enhance MoE benchmarking & tuning script by @WoosukKwon in #4921
- [Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend by @afeldman-nm in #5210
- [Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor by @zifeitong in #5229
- [CI/Build] Add inputs tests by @DarkLight1337 in #5215
- [Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU backend by @DamonFool in #5249
- [Kernel] Add back batch size 1536 and 3072 to MoE tuning by @WoosukKwon in #5242
- [CI/Build] Simplify model loading for
HfRunner
by @DarkLight1337 in #5251 - [CI/Build] Reducing CPU CI execution time by @bigPYJ1151 in #5241
- [CI] mark AMD test as softfail to prevent blockage by @simon-mo in #5256
- [Misc] Add transformers version to collect_env.py by @mgoin in #5259
- [Misc] update collect env by @youkaichao in #5261
- [Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True by @zifeitong in #5226
- [Misc] Add CustomOp interface for device portability by @WoosukKwon in #5255
- [Misc] Fix docstring of get_attn_backend by @WoosukKwon in #5271
- [Frontend] OpenAI API server: Add
add_special_tokens
to ChatCompletionRequest (default False) by @tomeras91 in #5278 - [CI] Add nightly benchmarks by @simon-mo in #5260
- [misc] benchmark_serving.py -- add ITL results and tweak TPOT results by @tlrmchlsmth in #5263
- [Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size by @tlrmchlsmth in #5157
- [Model] Correct Mixtral FP8 checkpoint loading by @comaniac in #5231
- [BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM by @DriverSong in #5207
- [Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 by @pcmoritz in #5238
- [Docs] Add Sequoia as sponsors by @simon-mo in #5287
- [Speculative Decoding] Add
ProposerWorkerBase
abstract class by @njhill in #5252 - [BugFix] Fix log message about default max model length by @njhill in #5284
- [Bugfix] Make EngineArgs use named arguments for config construction by @mgoin in #5285
- [Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. by @wuisawesome in #5290
- [Misc] Skip for logits_scale == 1.0 by @WoosukKwon in #5291
- [Docs] Add Ray Summit CFP by @simon-mo in #5295
- [CI] Disable flash_attn backend for spec decode by @simon-mo in #5286
- [Frontend][Core] Update Outlines Integration from
FSM
toGuide
by @br3no in #4109 - [CI/Build] Update vision tests by @DarkLight1337 in #5307
- Bugfix: fix broken of download models from modelscope by @liuyhwangyh in #5233
- [Kernel] Retune Mixtral 8x22b configs for FP8 on H100 by @pcmoritz in #5294
- [Frontend] enable passing multiple LoRA adapters at once to generate() by @mgoldey in #5300
- [Core] Avoid copying prompt/output tokens if no penalties are used by @Yard1 in #5289
- [Core] Change LoRA embedding sharding to support loading methods by @Yard1 in #5038
- [Misc] Missing error message for custom ops import by @DamonFool in #5282
- [Feature][Frontend]: Add support for
stream_options
inChatCompletionRequest
by @Etelis in #5135 - [Misc][Utils] allow get_open_port to be called for multiple times by @youkaichao in #5333
- [Kernel] Switch fp8 layers to use the CUTLASS kernels by @tlrmchlsmth in #5183
- Remove Ray health check by @Yard1 in #4693
- Addition of lacked ignored_seq_groups in _schedule_chunked_prefill by @JamesLim-sy in #5296
- [Kernel] Dynamic Per-Token Activation Quantization by @dsikka in #5037
- [Frontend] Add OpenAI Vision API Support by @ywang96 in #5237
- [Misc] Remove unused cuda_utils.h in CPU backend by @DamonFool in #5345
- fix DbrxFusedNormAttention missing cache_config by @Calvinnncy97 in #5340
- [Bug Fix] Fix the support check for FP8 CUTLASS by @cli99 in #5352
- [Misc] Add args for selecting distributed executor to benchmarks by @BKitor in #5335
- [ROCm][AMD] Use pytorch sdpa math backend to do naive attention by @hongxiayang in #4965
- [CI/Test] improve robustness of test by replacing del with context manager (hf_runner) by @youkaichao in #5347
- [CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) by @youkaichao in #5357
- [Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale by @mgoin in #5353
- [Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint by @youkaichao in #5074
- [mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py by @youkaichao in #5361
- [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops by @bnellnm in #5047
- [Bugfix] Fix KeyError: 1 When Using LoRA adapters by @BlackBird-Coding in #5164
- [Misc] Update to comply with the new
compressed-tensors
config by @dsikka in #5350 - [Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server by @ywang96 in #5374
- [misc][typo] fix typo by @youkaichao in #5372
- [Misc] Improve error message when LoRA parsing fails by @DarkLight1337 in #5194
- [Model] Initial support for LLaVA-NeXT by @DarkLight1337 in #4199
- [Feature][Frontend]: Continued
stream_options
implementation also in CompletionRequest by @Etelis in #5319 - [Bugfix] Fix LLaVA-NeXT by @DarkLight1337 in #5380
- [ci] Use small_cpu_queue for doc build by @khluu in #5331
- [ci] Mount buildkite agent on Docker container to upload benchmark results by @khluu in #5330
- [Docs] Add Docs on Limitations of VLM Support by @ywang96 in #5383
- [Docs] Alphabetically sort sponsors by @WoosukKwon in #5386
- Bump version to v0.5.0 by @simon-mo in #5384
- [Doc] Add documentation for FP8 W8A8 by @mgoin in #5388
- [ci] Fix Buildkite agent path by @khluu in #5392
- [Misc] Various simplifications and typing fixes by @njhill in #5368
- [Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs by @maor-ps in #5312
- [Bugfix][Frontend] Cleanup "fix chat logprobs" by @DarkLight1337 in #5026
- [Doc] add debugging tips by @youkaichao in #5409
- [Doc][Typo] Fixing Missing Comma by @ywang96 in #5403
- [Misc] Remove VLLM_BUILD_WITH_NEURON env variable by @WoosukKwon in #5389
- [CI] docfix by @rkooo567 in #5410
- [Speculative decoding] Initial spec decode docs by @cadedaniel in #5400
- [Doc] Add an automatic prefix caching section in vllm documentation by @KuntaiDu in #5324
- [Docs] [Spec decode] Fix docs error in code example by @cadedaniel in #5427
- [Bugfix] Fix
MultiprocessingGPUExecutor.check_health
when world_size == 1 by @jsato8094 in #5254 - [Bugfix] fix lora_dtype value type in arg_utils.py by @c3-ali in #5398
- [Frontend] Customizable RoPE theta by @sasha0552 in #5197
- [Core][Distributed] add same-node detection by @youkaichao in #5369
- [Core][Doc] Default to multiprocessing for single-node distributed case by @njhill in #5230
- [Doc] add common case for long waiting time by @youkaichao in #5430
New Contributors
- @dtrifiro made their first contribution in #5034
- @varun-sundar-rabindranath made their first contribution in #5144
- @dashanji made their first contribution in #5151
- @chenqianfzh made their first contribution in #4776
- @Delviet made their first contribution in #5180
- @Avinash-Raj made their first contribution in #4643
- @zhouyuan made their first contribution in #4113
- @Kaiyang-Chen made their first contribution in #3834
- @khluu made their first contribution in #5110
- @toslunar made their first contribution in #5217
- @DamonFool made their first contribution in #5249
- @tomeras91 made their first contribution in #5278
- @DriverSong made their first contribution in #5207
- @mgoldey made their first contribution in #5300
- @JamesLim-sy made their first contribution in #5296
- @Calvinnncy97 made their first contribution in #5340
- @cli99 made their first contribution in #5352
- @BKitor made their first contribution in #5335
- @BlackBird-Coding made their first contribution in #5164
- @maor-ps made their first contribution in #5312
- @c3-ali made their first contribution in #5398
Full Changelog: v0.4.3...v0.5.0