Skip to content

v0.4.2

Compare
Choose a tag to compare
@github-actions github-actions released this 05 May 04:31
· 3345 commits to main since this release
c7f2cf2

Highlights

Features

Models and Enhancements

Dependency Upgrade

  • Upgrade to torch==2.3.0 (#4454)
  • Upgrade to tensorizer==2.9.0 (#4467)
  • Expansion of AMD test suite (#4267)

Progress and Dev Experience

What's Changed

  • [Core][Distributed] use existing torch.cuda.device context manager by @youkaichao in #4318
  • [Misc] Update ShareGPT Dataset Sampling in Serving Benchmark by @ywang96 in #4279
  • [Bugfix] Fix marlin kernel crash on H100 by @alexm-nm in #4218
  • [Doc] Add note for docker user by @youkaichao in #4340
  • [Misc] Use public API in benchmark_throughput by @zifeitong in #4300
  • [Model] Adds Phi-3 support by @caiom in #4298
  • [Core] Move ray_utils.py from engine to executor package by @njhill in #4347
  • [Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 by @Isotr0py in #4324
  • [CI/Build] Adding functionality to reset the node's GPUs before processing. by @Alexei-V-Ivanov-AMD in #4213
  • [Doc] README Phi-3 name fix. by @caiom in #4372
  • [Core]refactor aqlm quant ops by @jikunshang in #4351
  • [Mypy] Typing lora folder by @rkooo567 in #4337
  • [Misc] Optimize flash attention backend log by @esmeetu in #4368
  • [Core] Add shutdown() method to ExecutorBase by @njhill in #4349
  • [Core] Move function tracing setup to util function by @njhill in #4352
  • [ROCm][Hardware][AMD][Doc] Documentation update for ROCm by @hongxiayang in #4376
  • [Bugfix] Fix parameter name in get_tokenizer by @DarkLight1337 in #4107
  • [Frontend] Add --log-level option to api server by @normster in #4377
  • [CI] Disable non-lazy string operation on logging by @rkooo567 in #4326
  • [Core] Refactoring sampler and support prompt logprob for chunked prefill by @rkooo567 in #4309
  • [Misc][Refactor] Generalize linear_method to be quant_method by @comaniac in #4373
  • [Misc] add RFC issue template by @youkaichao in #4401
  • [Core] Introduce DistributedGPUExecutor abstract class by @njhill in #4348
  • [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales by @pcmoritz in #4343
  • [Frontend][Bugfix] Disallow extra fields in OpenAI API by @DarkLight1337 in #4355
  • [Misc] Fix logger format typo by @esmeetu in #4396
  • [ROCm][Hardware][AMD] Enable group query attention for triton FA by @hongxiayang in #4406
  • [Kernel] Full Tensor Parallelism for LoRA Layers by @FurtherAI in #3524
  • [Model] Phi-3 4k sliding window temp. fix by @caiom in #4380
  • [Bugfix][Core] Fix get decoding config from ray by @esmeetu in #4335
  • [Bugfix] Abort requests when the connection to /v1/completions is interrupted by @chestnut-Q in #4363
  • [BugFix] Fix min_tokens when eos_token_id is None by @njhill in #4389
  • ✨ support local cache for models by @prashantgupta24 in #4374
  • [BugFix] Fix return type of executor execute_model methods by @njhill in #4402
  • [BugFix] Resolved Issues For LinearMethod --> QuantConfig by @robertgshaw2-neuralmagic in #4418
  • [Misc] fix typo in llm_engine init logging by @DefTruth in #4428
  • Add more Prometheus metrics by @ronensc in #2764
  • [CI] clean docker cache for neuron by @simon-mo in #4441
  • [mypy][5/N] Support all typing on model executor by @rkooo567 in #4427
  • [Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin by @robertgshaw2-neuralmagic in #3922
  • [CI] hotfix: soft fail neuron test by @simon-mo in #4458
  • [Core][Distributed] use cpu group to broadcast metadata in cpu by @youkaichao in #4444
  • [Misc] Upgrade to torch==2.3.0 by @mgoin in #4454
  • [Bugfix][Kernel] Fix compute_type for MoE kernel by @WoosukKwon in #4463
  • [Core]Refactor gptq_marlin ops by @jikunshang in #4466
  • [BugFix] fix num_lookahead_slots missing in async executor by @leiwen83 in #4165
  • [Doc] add visualization for multi-stage dockerfile by @prashantgupta24 in #4456
  • [Kernel] Support Fp8 Checkpoints (Dynamic + Static) by @robertgshaw2-neuralmagic in #4332
  • [Frontend] Support complex message content for chat completions endpoint by @fgreinacher in #3467
  • [Frontend] [Core] Tensorizer: support dynamic num_readers, update version by @alpayariyak in #4467
  • [Bugfix][Minor] Make ignore_eos effective by @bigPYJ1151 in #4468
  • fix_tokenizer_snapshot_download_bug by @kingljl in #4493
  • Unable to find Punica extension issue during source code installation by @kingljl in #4494
  • [Core] Centralize GPU Worker construction by @njhill in #4419
  • [Misc][Typo] type annotation fix by @HarryWu99 in #4495
  • [Misc] fix typo in block manager by @Juelianqvq in #4453
  • Allow user to define whitespace pattern for outlines by @robcaulk in #4305
  • [Misc]Add customized information for models by @jeejeelee in #4132
  • [Test] Add ignore_eos test by @rkooo567 in #4519
  • [Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. by @AnyISalIn in #4173
  • [Bugfix] Fix 307 Redirect for /metrics by @robertgshaw2-neuralmagic in #4523
  • [Doc] update(example model): for OpenAI compatible serving by @fpaupier in #4503
  • [Bugfix] Use random seed if seed is -1 by @sasha0552 in #4531
  • [CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation by @tjohnson31415 in #4534
  • [Speculative decoding] Add ngram prompt lookup decoding by @leiwen83 in #4237
  • [Core] Enable prefix caching with block manager v2 enabled by @leiwen83 in #4142
  • [Core] Add multiproc_worker_utils for multiprocessing-based workers by @njhill in #4357
  • [Kernel] Update fused_moe tuning script for FP8 by @pcmoritz in #4457
  • [Bugfix] Add validation for seed by @sasha0552 in #4529
  • [Bugfix][Core] Fix and refactor logging stats by @esmeetu in #4336
  • [Core][Distributed] fix pynccl del error by @youkaichao in #4508
  • [Misc] Remove Mixtral device="cuda" declarations by @pcmoritz in #4543
  • [Misc] Fix expert_ids shape in MoE by @WoosukKwon in #4517
  • [MISC] Rework logger to enable pythonic custom logging configuration to be provided by @tdg5 in #4273
  • [Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption by @rkooo567 in #4451
  • [CI]Add regression tests to ensure the async engine generates metrics by @ronensc in #4524
  • [mypy][6/N] Fix all the core subdirectory typing by @rkooo567 in #4450
  • [Core][Distributed] enable multiple tp group by @youkaichao in #4512
  • [Kernel] Support running GPTQ 8-bit models in Marlin by @alexm-nm in #4533
  • [mypy][7/N] Cover all directories by @rkooo567 in #4555
  • [Misc] Exclude the tests directory from being packaged by @itechbear in #4552
  • [BugFix] Include target-device specific requirements.txt in sdist by @markmc in #4559
  • [Misc] centralize all usage of environment variables by @youkaichao in #4548
  • [kernel] fix sliding window in prefix prefill Triton kernel by @mmoskal in #4405
  • [CI/Build] AMD CI pipeline with extended set of tests. by @Alexei-V-Ivanov-AMD in #4267
  • [Core] Ignore infeasible swap requests. by @rkooo567 in #4557
  • [Core][Distributed] enable allreduce for multiple tp groups by @youkaichao in #4566
  • [BugFix] Prevent the task of _force_log from being garbage collected by @Atry in #4567
  • [Misc] remove chunk detected debug logs by @DefTruth in #4571
  • [Doc] add env vars to the doc by @youkaichao in #4572
  • [Core][Model runner refactoring 1/N] Refactor attn metadata term by @rkooo567 in #4518
  • [Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None by @mgoin in #4586
  • Fix/async chat serving by @schoennenbeck in #2727
  • [Kernel] Use flashinfer for decoding by @LiuXiaoxuanPKU in #4353
  • [Speculative decoding] Support target-model logprobs by @cadedaniel in #4378
  • [Misc] add installation time env vars by @youkaichao in #4574
  • [Misc][Refactor] Introduce ExecuteModelData by @comaniac in #4540
  • [Doc] Chunked Prefill Documentation by @rkooo567 in #4580
  • [Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) by @mgoin in #4527
  • [CI] check size of the wheels by @simon-mo in #4319
  • [Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics by @DearPlanet in #3937
  • bump version to v0.4.2 by @simon-mo in #4600
  • [CI] Reduce wheel size by not shipping debug symbols by @simon-mo in #4602

New Contributors

Full Changelog: v0.4.1...v0.4.2