Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MFM-2025-02-03] Merge Main to llama fp8; With Faster ROCm Paged Attention #399

Merged
merged 752 commits into from
Feb 3, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
752 commits
Select commit Hold shift + click to select a range
b25cfab
[V1] Avoid sending text prompt to core engine (#11963)
ywang96 Jan 12, 2025
43f3d9e
[CI/Build] Add markdown linter (#11857)
rafvasq Jan 12, 2025
f967e51
[Model] Initialize support for Deepseek-VL2 models (#11578)
Isotr0py Jan 12, 2025
8bddb73
[Hardware][CPU] Multi-LoRA implementation for the CPU backend (#11100)
Akshat-Tripathi Jan 12, 2025
263a870
[Hardware][TPU] workaround fix for MoE on TPU (#11764)
avshalomman Jan 12, 2025
9597a09
[V1][Core][1/n] Logging and Metrics (#11962)
robertgshaw2-redhat Jan 12, 2025
d14e98d
[Model] Support GGUF models newly added in `transformers` 4.46.0 (#9685)
Isotr0py Jan 13, 2025
619ae26
[V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (#11973)
robertgshaw2-redhat Jan 13, 2025
f7b3ba8
[MISC] fix typo in kv transfer send recv test (#11983)
yyccli Jan 13, 2025
9dd02d8
[Bug] Fix usage of `.transpose()` and `.view()` consecutively. (#11979)
liaoyanqing666 Jan 13, 2025
80ea3af
[CI][Spec Decode] fix: broken test for EAGLE model (#11972)
llsj14 Jan 13, 2025
cf6bbcb
[Misc] Fix Deepseek V2 fp8 kv-scale remapping (#11947)
Concurrensee Jan 13, 2025
c3f05b0
[Misc]Minor Changes about Worker (#11555)
noemotiovon Jan 13, 2025
89ce62a
[platform] add ray_device_key (#11948)
youkaichao Jan 13, 2025
5340a30
Fix Max Token ID for Qwen-VL-Chat (#11980)
alex-jw-brooks Jan 13, 2025
0f8cafe
[Kernel] unified_attention for Attention.forward (#11967)
heheda12345 Jan 13, 2025
cd82499
[Doc][V1] Update model implementation guide for V1 support (#11998)
ywang96 Jan 13, 2025
e8c23ff
[Doc] Organise installation documentation into categories and tabs (#…
hmellor Jan 13, 2025
458e63a
[platform] add device_control env var (#12009)
youkaichao Jan 13, 2025
a7d5968
[Platform] Move get_punica_wrapper() function to Platform (#11516)
shen-shanshan Jan 13, 2025
c6db213
bugfix: Fix signature mismatch in benchmark's `get_tokenizer` functio…
e1ijah1 Jan 13, 2025
289b519
[Doc] Fix build from source and installation link in README.md (#12013)
Yikun Jan 13, 2025
ce53f46
Merge remote-tracking branch 'upstream/main'
gshtras Jan 13, 2025
5a51290
Using list
gshtras Jan 13, 2025
f35ec46
[Bugfix] Fix deepseekv3 gate bias error (#12002)
SunflowerAries Jan 13, 2025
079750e
Revert "[misc] improve memory profiling (#11809)"
gshtras Jan 13, 2025
113274a
Multi-lingual P3L (#356)
Alexei-V-Ivanov-AMD Jan 13, 2025
043c93d
Trying to make scales work with compileable attention
gshtras Jan 13, 2025
1a40125
[Docs] Add Sky Computing Lab to project intro (#12019)
WoosukKwon Jan 14, 2025
078da31
[HPU][Bugfix] set_forward_context and CI test execution (#12014)
kzawora-intel Jan 14, 2025
8a1f938
[Doc] Update Quantization Hardware Support Documentation (#12025)
tjtanaa Jan 14, 2025
ff39141
[HPU][misc] add comments for explanation (#12034)
youkaichao Jan 14, 2025
bb354e6
[Bugfix] Fix various bugs in multi-modal processor (#12031)
DarkLight1337 Jan 14, 2025
1f18adb
[Kernel] Revert the API change of Attention.forward (#12038)
heheda12345 Jan 14, 2025
2e0e017
[Platform] Add output for Attention Backend (#11981)
wangxiyuan Jan 14, 2025
a2d2acb
[Bugfix][Kernel] Give unique name to BlockSparseFlashAttention (#12040)
heheda12345 Jan 14, 2025
c9d6ff5
Explain where the engine args go when using Docker (#12041)
hmellor Jan 14, 2025
16f8680
Docs lint
gshtras Jan 14, 2025
eb4abfd
Merge remote-tracking branch 'origin/main' into upstream_merge_25_01_13
gshtras Jan 14, 2025
87054a5
[Doc]: Update the Json Example of the `Engine Arguments` document (#1…
maang-h Jan 14, 2025
5976f48
Merge pull request #358 from ROCm/upstream_merge_25_01_13
gshtras Jan 14, 2025
a3a3ee4
[Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_…
jeejeelee Jan 14, 2025
42f5e7c
[Kernel] Support MulAndSilu (#11624)
jeejeelee Jan 15, 2025
1a51b9f
[HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in se…
kzawora-intel Jan 15, 2025
9ddac56
[Platform] move current_memory_usage() into platform (#11369)
shen-shanshan Jan 15, 2025
b7ee940
[V1][BugFix] Fix edge case in VLM scheduling (#12065)
WoosukKwon Jan 15, 2025
0794e74
[Misc] Add multipstep chunked-prefill support for FlashInfer (#10467)
elfiegg Jan 15, 2025
f218f9c
[core] Turn off GPU communication overlap for Ray executor (#12051)
ruisearch42 Jan 15, 2025
ad34c0d
[core] platform agnostic executor via collective_rpc (#11256)
youkaichao Jan 15, 2025
3f9b7ab
[Doc] Update examples to remove SparseAutoModelForCausalLM (#12062)
kylesayrs Jan 15, 2025
994fc65
[V1][Prefix Cache] Move the logic of num_computed_tokens into KVCache…
heheda12345 Jan 15, 2025
cbe9439
Fix: cases with empty sparsity config (#12057)
rahul-tuli Jan 15, 2025
ad388d2
Type-fix: make execute_model output type optional (#12020)
youngkent Jan 15, 2025
3adf0ff
[Platform] Do not raise error if _Backend is not found (#12023)
wangxiyuan Jan 15, 2025
97eb97b
[Model]: Support internlm3 (#12037)
RunningLeon Jan 15, 2025
5ecf3e0
Misc: allow to use proxy in `HTTPConnection` (#12042)
zhouyuan Jan 15, 2025
de0526f
[Misc][Quark] Upstream Quark format to VLLM (#10765)
kewang-xlnx Jan 15, 2025
57e729e
[Doc]: Update `OpenAI-Compatible Server` documents (#12082)
maang-h Jan 15, 2025
edce722
[Bugfix] use right truncation for non-generative tasks (#12050)
joerunde Jan 15, 2025
70755e8
[V1][Core] Autotune encoder cache budget (#11895)
ywang96 Jan 15, 2025
ebd8c66
[Bugfix] Fix _get_lora_device for HQQ marlin (#12090)
varun-sundar-rabindranath Jan 15, 2025
cd9d06f
Allow hip sources to be directly included when compiling for rocm. (#…
tvirolai-amd Jan 15, 2025
fa0050d
[Core] Default to using per_token quantization for fp8 when cutlass i…
elfiegg Jan 16, 2025
f8ef146
[Doc] Add documentation for specifying model architecture (#12105)
DarkLight1337 Jan 16, 2025
9aa1519
Various cosmetic/comment fixes (#12089)
mgoin Jan 16, 2025
dd7c9ad
[Bugfix] Remove hardcoded `head_size=256` for Deepseek v2 and v3 (#12…
Isotr0py Jan 16, 2025
bf53e0c
Support torchrun and SPMD-style offline inference (#12071)
youkaichao Jan 16, 2025
92e793d
[core] LLM.collective_rpc interface and RLHF example (#12084)
youkaichao Jan 16, 2025
874f7c2
[Bugfix] Fix max image feature size for Llava-one-vision (#12104)
ywang96 Jan 16, 2025
8bd76fb
Enable user marker for vllm profiling (#357)
Lzy17 Jan 16, 2025
5fd24ec
[misc] Add LoRA kernel micro benchmarks (#11579)
varun-sundar-rabindranath Jan 16, 2025
62b06ba
[Model] Add support for deepseek-vl2-tiny model (#12068)
Isotr0py Jan 16, 2025
c5a9406
Deepseek V3 support (#364)
gshtras Jan 16, 2025
d06e824
[Bugfix] Set enforce_eager automatically for mllama (#12127)
heheda12345 Jan 16, 2025
ebc73f2
[Bugfix] Fix a path bug in disaggregated prefill example script. (#12…
KuntaiDu Jan 17, 2025
fead53b
[CI]add genai-perf benchmark in nightly benchmark (#10704)
jikunshang Jan 17, 2025
1475847
[Doc] Add instructions on using Podman when SELinux is active (#12136)
terrytangyuan Jan 17, 2025
b8bfa46
[Bugfix] Fix issues in CPU build Dockerfile (#12135)
terrytangyuan Jan 17, 2025
d1adb9b
[BugFix] add more `is not None` check in VllmConfig.__post_init__ (#1…
heheda12345 Jan 17, 2025
d75ab55
[Misc] Add deepseek_vl2 chat template (#12143)
Isotr0py Jan 17, 2025
8027a72
[ROCm][MoE] moe tuning support for rocm (#12049)
divakar-amd Jan 17, 2025
69d765f
[V1] Move more control of kv cache initialization from model_executor…
heheda12345 Jan 17, 2025
07934cc
[Misc][LoRA] Improve the readability of LoRA error messages (#12102)
jeejeelee Jan 17, 2025
d4e6194
[CI/Build][CPU][Bugfix] Fix CPU CI (#12150)
bigPYJ1151 Jan 17, 2025
87a0c07
[core] allow callable in collective_rpc (#12151)
youkaichao Jan 17, 2025
58fd57f
[Bugfix] Fix score api for missing max_model_len validation (#12119)
wallashss Jan 17, 2025
54cacf0
[Bugfix] Mistral tokenizer encode accept list of str (#12149)
jikunshang Jan 17, 2025
b5b57e3
[AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (#12134)
gshtras Jan 17, 2025
7b98a65
[torch.compile] disable logging when cache is disabled (#12043)
youkaichao Jan 17, 2025
2b83503
[misc] fix cross-node TP (#12166)
youkaichao Jan 18, 2025
c09503d
[AMD][CI/Build][Bugfix] use pytorch stale wheel (#12172)
hongxiayang Jan 18, 2025
da02cb4
[core] further polish memory profiling (#12126)
youkaichao Jan 18, 2025
813f249
[Docs] Fix broken link in SECURITY.md (#12175)
russellb Jan 18, 2025
02798ec
[Model] Port deepseek-vl2 processor, remove dependency (#12169)
Isotr0py Jan 18, 2025
6d0e3d3
[core] clean up executor class hierarchy between v1 and v0 (#12171)
youkaichao Jan 18, 2025
32eb0da
[Misc] Support register quantization method out-of-tree (#11969)
ice-tong Jan 19, 2025
7a8a48d
[V1] Collect env var for usage stats (#12115)
simon-mo Jan 19, 2025
4e94951
[BUGFIX] Move scores to float32 in case of running xgrammar on cpu (#…
madamczykhabana Jan 19, 2025
630eb5b
[Bugfix] Fix multi-modal processors for transformers 4.48 (#12187)
DarkLight1337 Jan 19, 2025
e66faf4
[torch.compile] store inductor compiled Python file (#12182)
youkaichao Jan 19, 2025
936db11
benchmark_serving support --served-model-name param (#12109)
gujingit Jan 19, 2025
edaae19
[Misc] Add BNB support to GLM4-V model (#12184)
Isotr0py Jan 19, 2025
81763c5
[V1] Add V1 support of Qwen2-VL (#12128)
ywang96 Jan 19, 2025
bbe5f9d
[Model] Support for fairseq2 Llama (#11442)
MartinGleize Jan 19, 2025
df450aa
[Bugfix] Fix num_heads value for simple connector when tp enabled (#1…
ShangmingCai Jan 20, 2025
51ef828
[torch.compile] fix sym_tensor_indices (#12191)
youkaichao Jan 20, 2025
3ea7b94
Move linting to `pre-commit` (#11975)
hmellor Jan 20, 2025
c5c0620
[DOC] Fix typo in docstring and assert message (#12194)
terrytangyuan Jan 20, 2025
d264312
[DOC] Add missing docstring in LLMEngine.add_request() (#12195)
terrytangyuan Jan 20, 2025
0974c9b
[Bugfix] Fix incorrect types in LayerwiseProfileResults (#12196)
terrytangyuan Jan 20, 2025
8360979
[Model] Add Qwen2 PRM model support (#12202)
Isotr0py Jan 20, 2025
59a0192
[Core] Interface for accessing model from `VllmRunner` (#10353)
DarkLight1337 Jan 20, 2025
5c89a29
[misc] add placeholder format.sh (#12206)
youkaichao Jan 20, 2025
4001ea1
[CI/Build] Remove dummy CI steps (#12208)
DarkLight1337 Jan 20, 2025
3127e97
[CI/Build] Make pre-commit faster (#12212)
DarkLight1337 Jan 20, 2025
b37d827
[Model] Upgrade Aria to transformers 4.48 (#12203)
DarkLight1337 Jan 20, 2025
170eb35
[misc] print a message to suggest how to bypass commit hooks (#12217)
youkaichao Jan 20, 2025
c222f47
[core][bugfix] configure env var during import vllm (#12209)
youkaichao Jan 20, 2025
5f0ec39
[V1] Remove `_get_cache_block_size` (#12214)
heheda12345 Jan 20, 2025
86bfb6d
[Misc] Pass `attention` to impl backend (#12218)
wangxiyuan Jan 20, 2025
18572e3
[Bugfix] Fix `HfExampleModels.find_hf_info` (#12223)
DarkLight1337 Jan 20, 2025
9666369
[CI] Pass local python version explicitly to pre-commit mypy.sh (#12224)
heheda12345 Jan 20, 2025
031e6eb
Merge remote-tracking branch 'upstream/main'
gshtras Jan 20, 2025
3e1cadb
Merge pull request #368 from ROCm/upstream_merge_25_01_20
gshtras Jan 20, 2025
faa1815
Using ROCm6.3.1 base docker and building hipblas-common (#366)
gshtras Jan 20, 2025
7bd3630
[Misc] Update CODEOWNERS (#12229)
ywang96 Jan 20, 2025
af69a6a
fix: update platform detection for M-series arm based MacBook process…
isikhi Jan 20, 2025
da75122
[misc] add cuda runtime version to usage data (#12190)
youkaichao Jan 21, 2025
06a760d
[bugfix] catch xgrammar unsupported array constraints (#12210)
Jason-CKY Jan 21, 2025
750f4ca
[Kernel] optimize moe_align_block_size for cuda graph and large num_e…
jinzhen-lin Jan 21, 2025
ecf6781
Add quantization and guided decoding CODEOWNERS (#12228)
mgoin Jan 21, 2025
d4b62d4
[AMD][Build] Porting dockerfiles from the ROCm/vllm fork (#11777)
gshtras Jan 21, 2025
5fe6bf2
[BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 (#12230)
NickLucche Jan 21, 2025
2fc6944
[ci/build] disable failed and flaky tests (#12240)
youkaichao Jan 21, 2025
9691255
[Misc] Rename `MultiModalInputsV2 -> MultiModalInputs` (#12244)
DarkLight1337 Jan 21, 2025
1f1542a
[Misc]Add BNB quantization for PaliGemmaForConditionalGeneration (#1…
jeejeelee Jan 21, 2025
f2e9f2a
[Misc] Remove redundant TypeVar from base model (#12248)
DarkLight1337 Jan 21, 2025
a94eee4
[Bugfix] Fix mm_limits access for merged multi-modal processor (#12252)
DarkLight1337 Jan 21, 2025
c81081f
[torch.compile] transparent compilation with more logging (#12246)
youkaichao Jan 21, 2025
b197a5c
[V1][Bugfix] Fix data item ordering in mixed-modality inference (#12259)
ywang96 Jan 21, 2025
9a7c3a0
Remove pytorch comments for outlines + compressed-tensors (#12260)
tdoublep Jan 21, 2025
c646128
[Platform] improve platforms getattr (#12264)
MengqingCao Jan 21, 2025
3aec49e
[ci/build] update nightly torch for gh200 test (#12270)
youkaichao Jan 21, 2025
9705b90
[Bugfix] fix race condition that leads to wrong order of token return…
joennlae Jan 21, 2025
1e60f87
[Kernel] fix moe_align_block_size error condition (#12239)
jinzhen-lin Jan 21, 2025
132a132
[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types (#10907)
rickyyx Jan 21, 2025
18fd4a8
[Bugfix] Multi-sequence broken (#11898)
andylolu2 Jan 21, 2025
347eeeb
[Misc] Remove experimental dep from tracing.py (#12007)
codefromthecrypt Jan 21, 2025
fa9ee08
[Misc] Set default backend to SDPA for get_vit_attn_backend (#12235)
wangxiyuan Jan 21, 2025
9c485d9
[Core] Free CPU pinned memory on environment cleanup (#10477)
janimo Jan 21, 2025
78d7d30
Update pre-commit.yml (#374)
gshtras Jan 21, 2025
2acba47
[bugfix] moe tuning. rm is_navi() (#12273)
divakar-amd Jan 21, 2025
69196a9
[BUGFIX] When skip_tokenize_init and multistep are set, execution cra…
maleksan85 Jan 21, 2025
09ccc9c
[Documentation][AMD] Add information about prebuilt ROCm vLLM docker …
hongxiayang Jan 21, 2025
df76e5a
[VLM] Simplify post-processing of replacement info (#12269)
DarkLight1337 Jan 22, 2025
64ea24d
[ci/lint] Add back default arg for pre-commit (#12279)
khluu Jan 22, 2025
016e367
[CI] add docker volume prune to neuron CI (#12291)
liangfu Jan 22, 2025
cbdc4ad
[Ci/Build] Fix mypy errors on main (#12296)
DarkLight1337 Jan 22, 2025
222a9dc
[Benchmark] More accurate TPOT calc in `benchmark_serving.py` (#12288)
njhill Jan 22, 2025
66818e5
[core] separate builder init and builder prepare for each batch (#12253)
youkaichao Jan 22, 2025
4004f14
[Build] update requirements of no-device (#12299)
MengqingCao Jan 22, 2025
68ad4e3
[Core] Support fully transparent sleep mode (#11743)
youkaichao Jan 22, 2025
cd7b6f0
[VLM] Avoid unnecessary tokenization (#12310)
DarkLight1337 Jan 22, 2025
528dbca
[Model][Bugfix]: correct Aria model output (#12309)
xffxff Jan 22, 2025
16366ee
[Bugfix][VLM] Fix mixed-modality inference backward compatibility for…
ywang96 Jan 22, 2025
6609cdf
[Doc] Add docs for prompt replacement (#12318)
DarkLight1337 Jan 22, 2025
fc66dee
[Misc] Fix the error in the tip for the --lora-modules parameter (#12…
WangErXiao Jan 22, 2025
84bee4b
[Misc] Improve the readability of BNB error messages (#12320)
jeejeelee Jan 22, 2025
b5839a1
Skip tokenize/detokenize when it is disabled by arg --skip-tokenizer-…
maleksan85 Jan 22, 2025
96f6a75
[Bugfix] Fix HPU multiprocessing executor (#12167)
kzawora-intel Jan 22, 2025
7206ce4
[Core] Support `reset_prefix_cache` (#12284)
comaniac Jan 22, 2025
aea9436
[Frontend][V1] Online serving performance improvements (#12287)
njhill Jan 22, 2025
68c4421
[AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is brok…
rasmith Jan 23, 2025
a600e9f
FP8 FA fixes (#381)
ilia-cher Jan 23, 2025
5f9b40b
Returning the use of the proper stream in allreduce (#382)
gshtras Jan 23, 2025
8d7aa9d
[Bugfix] Fixing AMD LoRA CI test. (#12329)
Alexei-V-Ivanov-AMD Jan 23, 2025
01a5594
[Docs] Update FP8 KV Cache documentation (#12238)
mgoin Jan 23, 2025
7551a34
[Docs] Document vulnerability disclosure process (#12326)
russellb Jan 23, 2025
f0ef372
[V1] Add `uncache_blocks` (#12333)
comaniac Jan 23, 2025
5116274
[doc] explain common errors around torch.compile (#12340)
youkaichao Jan 23, 2025
8ae5ff2
[Hardware][Gaudi][BugFix] Fix dataclass error due to triton package u…
zhenwei-intel Jan 23, 2025
c5b4b11
[Bugfix] Fix k_proj's bias for whisper self attention (#12342)
Isotr0py Jan 23, 2025
978b45f
[Kernel] Flash Attention 3 Support (#12093)
LucasWilkinson Jan 23, 2025
d07efb3
[Doc] Troubleshooting errors during model inspection (#12351)
DarkLight1337 Jan 23, 2025
99d01a5
[V1] Simplify M-RoPE (#12352)
ywang96 Jan 23, 2025
8c01b80
[Bugfix] Fix broken internvl2 inference with v1 (#12360)
Isotr0py Jan 23, 2025
3f50c14
[core] add wake_up doc and some sanity check (#12361)
youkaichao Jan 23, 2025
6e650f5
[torch.compile] decouple compile sizes and cudagraph sizes (#12243)
youkaichao Jan 23, 2025
e97f802
[FP8][Kernel] Dynamic kv cache scaling factors computation (#11906)
gshtras Jan 23, 2025
2c85529
[TPU] Update TPU CI to use torchxla nightly on 20250122 (#12334)
lsy323 Jan 23, 2025
2cbeeda
[Docs] Document Phi-4 support (#12362)
Isotr0py Jan 23, 2025
eb5cb5e
[BugFix] Fix parameter names and `process_after_weight_loading` for W…
dsikka Jan 23, 2025
9726ad6
[Misc] Fix OpenAI API Compatibility Issues in Benchmark Script (#12357)
jsato8094 Jan 23, 2025
682b55b
[Docs] Add meetup slides (#12345)
WoosukKwon Jan 23, 2025
84f5d47
Using pytorch commit past the point when rowwise PR (https://github.c…
gshtras Jan 23, 2025
c5cffcd
[Docs] Update spec decode + structured output in compat matrix (#12373)
russellb Jan 24, 2025
24b0205
[V1][Frontend] Coalesce bunched `RequestOutput`s (#12298)
njhill Jan 24, 2025
d3d6bb1
Set weights_only=True when using torch.load() (#12366)
russellb Jan 24, 2025
5e5630a
[Bugfix] Path join when building local path for S3 clone (#12353)
omer-dayan Jan 24, 2025
55ef66e
Update compressed-tensors version (#12367)
dsikka Jan 24, 2025
0e74d79
[V1] Increase default batch size for H100/H200 (#12369)
WoosukKwon Jan 24, 2025
6dd94db
[perf] fix perf regression from #12253 (#12380)
youkaichao Jan 24, 2025
3c818bd
[Misc] Use VisionArena Dataset for VLM Benchmarking (#12389)
ywang96 Jan 24, 2025
c7c9851
[ci/build] fix wheel size check (#12396)
youkaichao Jan 24, 2025
9a0f3bd
[Hardware][Gaudi][Doc] Add missing step in setup instructions (#12382)
MohitIntel Jan 24, 2025
e784c6b
[ci/build] sync default value for wheel size (#12398)
youkaichao Jan 24, 2025
3bb8e2c
[Misc] Enable proxy support in benchmark script (#12356)
jsato8094 Jan 24, 2025
ab5bbf5
[Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (#12375)
LucasWilkinson Jan 24, 2025
8e87b08
Applying scales rename to fp8 config (#387)
gshtras Jan 24, 2025
df5dafa
[Misc] Remove deprecated code (#12383)
DarkLight1337 Jan 24, 2025
3132a93
[Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build o…
LucasWilkinson Jan 24, 2025
28b1ad9
Dev-docker Documentation Updates (#378)
JArnoldAMD Jan 25, 2025
221d388
[Bugfix][Kernel] Fix moe align block issue for mixtral (#12413)
ElizaWszola Jan 25, 2025
fb30ee9
[Bugfix] Fix BLIP-2 processing (#12412)
DarkLight1337 Jan 25, 2025
bf21481
[ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (#12408)
divakar-amd Jan 25, 2025
f1fc051
[Misc] Add FA2 support to ViT MHA layer (#12355)
Isotr0py Jan 25, 2025
324960a
[TPU][CI] Update torchxla version in requirement-tpu.txt (#12422)
lsy323 Jan 25, 2025
2a0309a
[Misc][Bugfix] FA3 support to ViT MHA layer (#12435)
ywang96 Jan 26, 2025
fa63e71
[V1][Perf] Reduce scheduling overhead in model runner after cuda sync…
youngkent Jan 26, 2025
0ee349b
[V1][Bugfix] Fix assertion when mm hashing is turned off (#12439)
ywang96 Jan 26, 2025
a525527
[Misc] Revert FA on ViT #12355 and #12435 (#12445)
ywang96 Jan 26, 2025
9ddc352
[Frontend] generation_config.json for maximum tokens(#12242)
mhendrey Jan 26, 2025
aa2cd2c
[Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417)
tlrmchlsmth Jan 26, 2025
72f4880
[Bugfix/CI] Fix broken kernels/test_mha.py (#12450)
tlrmchlsmth Jan 26, 2025
68f1114
[Bugfix][Kernel] Fix perf regression caused by PR #12405 (#12434)
LucasWilkinson Jan 26, 2025
72bac73
[Build/CI] Fix libcuda.so linkage (#12424)
tlrmchlsmth Jan 26, 2025
0034b09
[Frontend] Rerank API (Jina- and Cohere-compatible API) (#12376)
K-Mistele Jan 27, 2025
582cf78
[DOC] Add link to vLLM blog (#12460)
terrytangyuan Jan 27, 2025
28e0750
[V1] Avoid list creation in input preparation (#12457)
WoosukKwon Jan 27, 2025
0cc6b38
[Frontend] Support scores endpoint in run_batch (#12430)
pooyadavoodi Jan 27, 2025
5204ff5
[Bugfix] Fix Granite 3.0 MoE model loading (#12446)
DarkLight1337 Jan 27, 2025
372bf08
[Bugfix] Fix missing seq_start_loc in xformers prefill metadata (#12464)
Isotr0py Jan 27, 2025
624a1e4
[V1][Minor] Minor optimizations for update_from_output (#12454)
WoosukKwon Jan 27, 2025
ce69f7f
[Bugfix] Fix gpt2 GGUF inference (#12467)
Isotr0py Jan 27, 2025
103bd17
[Build] Only build 9.0a for scaled_mm and sparse kernels (#12339)
LucasWilkinson Jan 27, 2025
01ba927
[V1][Metrics] Add initial Prometheus logger (#12416)
markmc Jan 27, 2025
3f1fc74
[V1][CI/Test] Do basic test for top-p & top-k sampling (#12469)
WoosukKwon Jan 27, 2025
2bc3fbb
[FlashInfer] Upgrade to 0.2.0 (#11194)
abmfy Jan 27, 2025
8e6d987
Merge remote-tracking branch 'upstream/main'
gshtras Jan 27, 2025
6b2147f
Support FP8 FA from Quark format (#388)
BowenBao Jan 28, 2025
a892ecc
Merge remote-tracking branch 'origin/main' into upstream_merge_25_01_27
gshtras Jan 28, 2025
c8b8654
Direct call on ROCm
gshtras Jan 28, 2025
b2c3b22
Merge pull request #391 from ROCm/upstream_merge_25_01_27
gshtras Jan 28, 2025
7a292f9
20250127 docs update (#392)
arakowsk-amd Jan 29, 2025
273c949
Faster Custom Paged Attention kernels (#372)
sanyalington Jan 30, 2025
22141e7
Using a more precise profiling on ROCm to properly account for weight…
gshtras Jan 30, 2025
6852819
Update Dockerfile.rocm
gshtras Jan 30, 2025
6cfbe01
Merge remote-tracking branch 'origin/main' into main-to-llama-fp8
vllmellm Feb 3, 2025
b64246a
[Bugfix]: inclucde the env variables required for running FastSyncLLM
vllmellm Feb 3, 2025
64e4aa9
fix pre-commit lint
vllmellm Feb 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
7 changes: 5 additions & 2 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,11 @@
import sys
import zipfile

# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 250 MB
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 250))
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 300 MiB
# Note that we have 400 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/3792 .
# Please also sync the value with the one in Dockerfile.
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 300))


def print_top_10_largest_files(zip_file):
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/nightly-benchmarks/scripts/nightly-annotate.sh
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ main() {



# The figures should be genereated by a separate process outside the CI/CD pipeline
# The figures should be generated by a separate process outside the CI/CD pipeline

# # generate figures
# python3 -m pip install tabulate pandas matplotlib
Expand Down
107 changes: 107 additions & 0 deletions .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,104 @@ run_serving_tests() {
kill_gpu_processes
}

run_genai_perf_tests() {
# run genai-perf tests

# $1: a json file specifying genai-perf test cases
local genai_perf_test_file
genai_perf_test_file=$1

# Iterate over genai-perf tests
jq -c '.[]' "$genai_perf_test_file" | while read -r params; do
# get the test name, and append the GPU type back to it.
test_name=$(echo "$params" | jq -r '.test_name')

# if TEST_SELECTOR is set, only run the test cases that match the selector
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
echo "Skip test case $test_name."
continue
fi

# prepend the current serving engine to the test name
test_name=${CURRENT_LLM_SERVING_ENGINE}_${test_name}

# get common parameters
common_params=$(echo "$params" | jq -r '.common_parameters')
model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
reuse_server=$(echo "$common_params" | jq -r '.reuse_server')

# get client and server arguments
server_params=$(echo "$params" | jq -r ".${CURRENT_LLM_SERVING_ENGINE}_server_parameters")
qps_list=$(echo "$params" | jq -r '.qps_list')
qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
echo "Running over qps list $qps_list"

# check if there is enough GPU to run the test
if [[ $gpu_count -lt $tp ]]; then
echo "Required num-shard $tp but only $gpu_count GPU found. Skip testcase $test_name."
continue
fi

if [[ $reuse_server == "true" ]]; then
echo "Reuse previous server for test case $test_name"
else
kill_gpu_processes
bash "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/launch-server.sh" \
"$server_params" "$common_params"
fi

if wait_for_server; then
echo ""
echo "$CURRENT_LLM_SERVING_ENGINE server is up and running."
else
echo ""
echo "$CURRENT_LLM_SERVING_ENGINE failed to start within the timeout period."
break
fi

# iterate over different QPS
for qps in $qps_list; do
# remove the surrounding single quote from qps
if [[ "$qps" == *"inf"* ]]; then
echo "qps was $qps"
qps=$num_prompts
echo "now qps is $qps"
fi

new_test_name=$test_name"_qps_"$qps
backend=$CURRENT_LLM_SERVING_ENGINE

if [[ "$backend" == *"vllm"* ]]; then
backend="vllm"
fi
#TODO: add output dir.
client_command="genai-perf profile \
-m $model \
--service-kind openai \
--backend vllm \
--endpoint-type chat \
--streaming \
--url localhost:$port \
--request-rate $qps \
--num-prompts $num_prompts \
"

echo "Client command: $client_command"

eval "$client_command"

#TODO: process/record outputs
done
done

kill_gpu_processes

}

prepare_dataset() {

Expand Down Expand Up @@ -328,12 +426,17 @@ main() {

pip install -U transformers

pip install -r requirements-dev.txt
which genai-perf

# check storage
df -h

ensure_installed wget
ensure_installed curl
ensure_installed jq
# genai-perf dependency
ensure_installed libb64-0d

prepare_dataset

Expand All @@ -345,6 +448,10 @@ main() {
# run the test
run_serving_tests "$BENCHMARK_ROOT/tests/nightly-tests.json"

# run genai-perf tests
run_genai_perf_tests "$BENCHMARK_ROOT/tests/genai-perf-tests.json"
mv artifacts/ $RESULTS_FOLDER/

# upload benchmark results to buildkite
python3 -m pip install tabulate pandas
python3 "$BENCHMARK_ROOT/scripts/summary-nightly-results.py"
Expand Down
23 changes: 23 additions & 0 deletions .buildkite/nightly-benchmarks/tests/genai-perf-tests.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
[
{
"test_name": "llama8B_tp1_genai_perf",
"qps_list": [4,8,16,32],
"common_parameters": {
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"tp": 1,
"port": 8000,
"num_prompts": 500,
"reuse_server": false
},
"vllm_server_parameters": {
"disable_log_stats": "",
"disable_log_requests": "",
"gpu_memory_utilization": 0.9,
"num_scheduler_steps": 10,
"max_num_seqs": 512,
"dtype": "bfloat16"
},
"genai_perf_input_parameters": {
}
}
]
4 changes: 2 additions & 2 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,6 @@ function cpu_tests() {
tests/lora/test_qwen2vl.py"
}

# All of CPU tests are expected to be finished less than 25 mins.
# All of CPU tests are expected to be finished less than 40 mins.
export -f cpu_tests
timeout 30m bash -c "cpu_tests $CORE_RANGE $NUMA_NODE"
timeout 40m bash -c "cpu_tests $CORE_RANGE $NUMA_NODE"
12 changes: 10 additions & 2 deletions .buildkite/run-hpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,17 @@ set -ex
docker build -t hpu-test-env -f Dockerfile.hpu .

# Setup cleanup
# certain versions of HPU software stack have a bug that can
# override the exit code of the script, so we need to use
# separate remove_docker_container and remove_docker_container_and_exit
# functions, while other platforms only need one remove_docker_container
# function.
EXITCODE=1
remove_docker_container() { docker rm -f hpu-test || true; }
trap remove_docker_container EXIT
remove_docker_container_and_exit() { remove_docker_container; exit $EXITCODE; }
trap remove_docker_container_and_exit EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/basic.py
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/basic.py
EXITCODE=$?
5 changes: 4 additions & 1 deletion .buildkite/run-neuron-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,11 @@ if [ -f /tmp/neuron-docker-build-timestamp ]; then
last_build=$(cat /tmp/neuron-docker-build-timestamp)
current_time=$(date +%s)
if [ $((current_time - last_build)) -gt 86400 ]; then
# Remove dangling images (those that are not tagged and not used by any container)
docker image prune -f
docker system prune -f
# Remove unused volumes / force the system prune for old images as well.
docker volume prune -f && docker system prune -f
# Remove huggingface model artifacts and compiler cache
rm -rf "${HF_MOUNT:?}/*"
rm -rf "${NEURON_COMPILE_CACHE_MOUNT:?}/*"
echo "$current_time" > /tmp/neuron-docker-build-timestamp
Expand Down
31 changes: 26 additions & 5 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,6 @@ steps:
- tests/worker
- tests/standalone_tests/lazy_torch_compile.py
commands:
- pip install git+https://github.com/Isotr0py/DeepSeek-VL2.git # Used by multimoda processing test
- python3 standalone_tests/lazy_torch_compile.py
- pytest -v -s mq_llm_engine # MQLLMEngine
- pytest -v -s async_engine # AsyncLLMEngine
Expand All @@ -77,7 +76,9 @@ steps:
- tests/basic_correctness/test_basic_correctness
- tests/basic_correctness/test_cpu_offload
- tests/basic_correctness/test_preemption
- tests/basic_correctness/test_cumem.py
commands:
- pytest -v -s basic_correctness/test_cumem.py
- pytest -v -s basic_correctness/test_basic_correctness.py
- pytest -v -s basic_correctness/test_cpu_offload.py
- VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest -v -s basic_correctness/test_preemption.py
Expand Down Expand Up @@ -107,7 +108,7 @@ steps:
source_file_dependencies:
- vllm/
commands:
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_guided_generate.py
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_guided_generate.py --ignore=entrypoints/llm/test_collective_rpc.py
- pytest -v -s entrypoints/llm/test_lazy_outlines.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_generate_multiple_loras.py # it needs a clean process
Expand All @@ -126,11 +127,15 @@ steps:
- tests/distributed
- tests/spec_decode/e2e/test_integration_dist_tp4
- tests/compile
- examples/offline_inference/rlhf.py
commands:
- pytest -v -s distributed/test_utils.py
- pytest -v -s compile/test_basic_correctness.py
- pytest -v -s distributed/test_pynccl.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py
# TODO: create a dedicated test section for multi-GPU example tests
# when we have multiple distributed example tests
- python3 ../examples/offline_inference/rlhf.py

- label: Metrics, Tracing Test # 10min
num_gpus: 2
Expand Down Expand Up @@ -178,7 +183,16 @@ steps:
- vllm/
- tests/v1
commands:
- VLLM_USE_V1=1 pytest -v -s v1
# split the test to avoid interference
- VLLM_USE_V1=1 pytest -v -s v1/core
- VLLM_USE_V1=1 pytest -v -s v1/engine
- VLLM_USE_V1=1 pytest -v -s v1/sample
- VLLM_USE_V1=1 pytest -v -s v1/worker
- VLLM_USE_V1=1 pytest -v -s v1/test_stats.py
- VLLM_USE_V1=1 pytest -v -s v1/test_utils.py
# TODO: accuracy does not match, whether setting
# VLLM_USE_FLASHINFER_SAMPLER or not on H100.
- VLLM_USE_V1=1 pytest -v -s v1/e2e

- label: Examples Test # 25min
working_dir: "/vllm-workspace/examples"
Expand Down Expand Up @@ -462,7 +476,10 @@ steps:
- vllm/worker/worker_base.py
- vllm/worker/worker.py
- vllm/worker/model_runner.py
- entrypoints/llm/test_collective_rpc.py
commands:
- pytest -v -s entrypoints/llm/test_collective_rpc.py
- torchrun --nproc-per-node=2 distributed/test_torchrun_example.py
- pytest -v -s ./compile/test_basic_correctness.py
- pytest -v -s ./compile/test_wrapper.py
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
Expand All @@ -471,7 +488,9 @@ steps:
- pytest models/encoder_decoder/language/test_bart.py -v -s -m 'distributed(num_gpus=2)'
- pytest models/encoder_decoder/vision_language/test_broadcast.py -v -s -m 'distributed(num_gpus=2)'
- pytest models/decoder_only/vision_language/test_models.py -v -s -m 'distributed(num_gpus=2)'
- pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
# this test fails consistently.
# TODO: investigate and fix
# - pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s kv_transfer/disagg_test.py

Expand Down Expand Up @@ -509,7 +528,9 @@ steps:
- vllm/engine
- tests/multi_step
commands:
- pytest -v -s multi_step/test_correctness_async_llm.py
# this test is quite flaky
# TODO: investigate and fix.
# - pytest -v -s multi_step/test_correctness_async_llm.py
- pytest -v -s multi_step/test_correctness_llm.py

- label: Pipeline Parallelism Test # 45min
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/test-template.j2
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ steps:
depends_on:
- "amd-build"
agents:
queue: amd_rocm_gpu
queue: amd_gpu
commands:
- bash .buildkite/run-amd-test.sh "cd {{ (step.working_dir or default_working_dir) | safe }} ; {{ step.command or (step.commands | join(" && ")) | safe }}"
env:
Expand Down
27 changes: 15 additions & 12 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -2,32 +2,35 @@
# for more info about CODEOWNERS file

# This lists cover the "core" components of vLLM that require careful review
/vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/core @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/engine/llm_engine.py @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/worker/worker_base.py @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/worker/worker.py @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/model_executor/layers/sampler.py @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/core @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/engine/llm_engine.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/worker/worker_base.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/worker/worker.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/model_executor/layers/sampler.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth
/vllm/model_executor/guided_decoding @mgoin
/vllm/multimodal @DarkLight1337 @ywang96
CMakeLists.txt @tlrmchlsmth

# vLLM V1
/vllm/v1 @WoosukKwon @robertgshaw2-neuralmagic @njhill @ywang96 @comaniac @alexm-neuralmagic
/vllm/v1 @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @comaniac @alexm-redhat

# Test ownership
/tests/async_engine @njhill @robertgshaw2-neuralmagic @simon-mo
/tests/async_engine @njhill @robertgshaw2-redhat @simon-mo
/tests/test_inputs.py @DarkLight1337 @ywang96
/tests/entrypoints @DarkLight1337 @robertgshaw2-neuralmagic @simon-mo
/tests/entrypoints @DarkLight1337 @robertgshaw2-redhat @simon-mo
/tests/models @DarkLight1337 @ywang96
/tests/multimodal @DarkLight1337 @ywang96
/tests/prefix_caching @comaniac @KuntaiDu
/tests/spec_decode @njhill @LiuXiaoxuanPKU
/tests/kernels @tlrmchlsmth @WoosukKwon
/tests/quantization @mgoin @robertgshaw2-neuralmagic
/tests/quantization @mgoin @robertgshaw2-redhat
/.buildkite/lm-eval-harness @mgoin @simon-mo
/tests/distributed/test_multi_node_assignment.py @youkaichao
/tests/distributed/test_pipeline_parallel.py @youkaichao
/tests/distributed/test_same_node.py @youkaichao
/tests/multi_step @alexm-neuralmagic @comaniac
/tests/multi_step @alexm-redhat @comaniac
/tests/weight_loading @mgoin @youkaichao
/tests/basic_correctness/test_chunked_prefill @rkooo567 @comaniac
Loading
Loading