Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge from main #1

Merged
merged 174 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
174 commits
Select commit Hold shift + click to select a range
15bb833
[Bugfix] Fix tensor parallel for qwen2 classification model (#10297)
Isotr0py Nov 14, 2024
504ac53
[misc] error early for old-style class (#10304)
youkaichao Nov 14, 2024
e0853b6
[Misc] format.sh: Simplify tool_version_check (#10305)
russellb Nov 14, 2024
f67ce05
[Frontend] Pythonic tool parser (#9859)
mdepinet Nov 14, 2024
52b48c1
[BugFix]: properly deserialize `tool_calls` iterator before processin…
gcalmettes Nov 14, 2024
294bf46
[Model] Add BNB quantization support for Idefics3 (#10310)
B-201 Nov 14, 2024
29f3ef2
[ci][distributed] disable hanging tests (#10317)
youkaichao Nov 14, 2024
03025c0
[CI/Build] Fix CPU CI online inference timeout (#10314)
Isotr0py Nov 14, 2024
675d603
[CI/Build] Make shellcheck happy (#10285)
DarkLight1337 Nov 14, 2024
1dbae03
[Docs] Publish meetup slides (#10331)
WoosukKwon Nov 14, 2024
4a18fd1
Support Roberta embedding models (#9387)
maxdebayser Nov 14, 2024
b2e0ad3
[Perf] Reduce peak memory usage of llama (#10339)
andoorve Nov 15, 2024
554af92
[Bugfix] use AF_INET6 for OpenAI Compatible Server with ipv6 (#9583)
jxpxxzj Nov 15, 2024
11cd1ae
[Tool parsing] Improve / correct mistral tool parsing (#10333)
patrickvonplaten Nov 15, 2024
972112d
[Bugfix] Fix unable to load some models (#10312)
DarkLight1337 Nov 15, 2024
bf2ddc6
[bugfix] Fix static asymmetric quantization case (#10334)
ProExpertProg Nov 15, 2024
2885ba0
[Misc] Change RedundantReshapesPass and FusionPass logging from info …
tlrmchlsmth Nov 15, 2024
b40cf64
[Model] Support Qwen2 embeddings and use tags to select model tests (…
DarkLight1337 Nov 15, 2024
2ec8827
[Bugfix] Qwen-vl output is inconsistent in speculative decoding (#10…
skylee-01 Nov 15, 2024
2ac6d0e
[Misc] Consolidate pooler config overrides (#10351)
DarkLight1337 Nov 15, 2024
02dbf30
[Build] skip renaming files for release wheels pipeline (#9671)
simon-mo Nov 15, 2024
3d158cd
Add default value to avoid Falcon crash (#5363) (#10347)
wchen61 Nov 15, 2024
b311efd
[Misc] Fix import error in tensorizer tests and cleanup some code (#1…
DarkLight1337 Nov 15, 2024
2690855
[Doc] Remove float32 choice from --lora-dtype (#10348)
xyang16 Nov 15, 2024
1d65ec7
[Bugfix] Fix fully sharded LoRA bug (#10352)
jeejeelee Nov 15, 2024
f2056f7
[Misc] Fix some help info of arg_utils to improve readability (#10362)
ShangmingCai Nov 15, 2024
3a763ba
[core][misc] keep compatibility for old-style classes (#10356)
youkaichao Nov 15, 2024
691a3ec
[Bugfix] Ensure special tokens are properly filtered out for guided s…
gcalmettes Nov 15, 2024
79ee45b
[Misc] Bump up test_fused_moe tolerance (#10364)
ElizaWszola Nov 15, 2024
a6221a1
[Misc] bump mistral common version (#10367)
simon-mo Nov 15, 2024
c76ac49
[Docs] Add Nebius as sponsors (#10371)
simon-mo Nov 15, 2024
a067f85
[Frontend] Add --version flag to CLI (#10369)
russellb Nov 15, 2024
3e8d14d
[Doc] Move PR template content to docs (#10159)
russellb Nov 15, 2024
4f168f6
[Docs] Misc updates to TPU installation instructions (#10165)
mikegre-google Nov 15, 2024
32e46e0
[Frontend] Automatic detection of chat content format from AST (#9919)
DarkLight1337 Nov 16, 2024
755b853
[doc] add doc for the plugin system (#10372)
youkaichao Nov 16, 2024
2f427c2
[misc][plugin] improve log messages (#10386)
youkaichao Nov 16, 2024
1d75472
[BugFix] [Kernel] Fix GPU SEGV occuring in fused_moe kernel (#10385)
rasmith Nov 16, 2024
8b6725b
[Misc] Update benchmark to support image_url file or http (#10287)
kakao-steve-ai Nov 16, 2024
b98d89e
[Misc] Medusa supports custom bias (#10361)
skylee-01 Nov 16, 2024
361c29e
[Bugfix] Fix M-RoPE position calculation when chunked prefill is enab…
imkero Nov 16, 2024
661a34f
[V1] Add code owners for V1 (#10397)
WoosukKwon Nov 16, 2024
4fd9375
[2/N][torch.compile] make compilation cfg part of vllm cfg (#10383)
youkaichao Nov 17, 2024
643ecf7
[V1] Refactor model executable interface for all text-only language m…
ywang96 Nov 17, 2024
905d0f0
[CI/Build] Fix IDC hpu [Device not found] issue (#10384)
xuechendi Nov 17, 2024
cf349c4
[Bugfix][CPU] Fix CPU embedding runner with tensor parallel (#10394)
Isotr0py Nov 17, 2024
8d74b5a
[platforms] refactor cpu code (#10402)
youkaichao Nov 17, 2024
76aab90
[Hardware] [HPU]add `mark_step` for hpu (#10239)
jikunshang Nov 17, 2024
80d85c5
[Bugfix] Fix mrope_position_delta in non-last prefill chunk (#10403)
imkero Nov 17, 2024
d1557e6
[Misc] Enhance offline_inference to support user-configurable paramet…
wchen61 Nov 17, 2024
c4e4643
[Misc] Add uninitialized params tracking for `AutoWeightsLoader` (#10…
Isotr0py Nov 18, 2024
47826ca
[Bugfix] Ignore ray reinit error when current platform is ROCm or XPU…
HollowMan6 Nov 18, 2024
51bb12d
[4/N][torch.compile] clean up set_torch_compile_backend (#10401)
youkaichao Nov 18, 2024
c7dec92
[VLM] Report multi_modal_placeholders in output (#10407)
lk-chen Nov 18, 2024
01aae1c
[Model] Remove redundant softmax when using PoolingType.STEP (#10415)
Maybewuss Nov 18, 2024
5be4e52
[Model][LoRA]LoRA support added for glm-4v (#10418)
B-201 Nov 18, 2024
e7ebb66
[Model] Remove transformers attention porting in VITs (#10414)
Isotr0py Nov 18, 2024
4186be8
[Doc] Update doc for LoRA support in GLM-4V (#10425)
B-201 Nov 18, 2024
7851b45
[5/N][torch.compile] torch.jit.script --> torch.compile (#10406)
youkaichao Nov 18, 2024
31894a2
[Doc] Add documentation for Structured Outputs (#9943)
ismael-dm Nov 18, 2024
4f686d1
Fix open_collective value in FUNDING.yml (#10426)
andrew Nov 18, 2024
281cc4b
[Model][Bugfix] Support TP for PixtralHF ViT (#10405)
mgoin Nov 18, 2024
6b2d25e
[Hardware][XPU] AWQ/GPTQ support for xpu backend (#10107)
yma11 Nov 18, 2024
c2170a5
[Kernel] Explicitly specify other value in tl.load calls (#9014)
angusYuhao Nov 18, 2024
96d999f
[Kernel] Initial Machete W4A8 support + Refactors (#9855)
LucasWilkinson Nov 18, 2024
a03ea40
[3/N][torch.compile] consolidate custom op logging (#10399)
youkaichao Nov 18, 2024
2298e69
[ci][bugfix] fix kernel tests (#10431)
youkaichao Nov 18, 2024
90a6c75
[misc] partial prefix & random input generation benchmark (#9929)
rickyyx Nov 18, 2024
284203f
[ci/build] Have dependabot ignore all patch update (#10436)
khluu Nov 19, 2024
7eb719d
[Bugfix]Fix Phi-3 BNB online quantization (#10417)
jeejeelee Nov 19, 2024
8c1fb50
[Platform][Refactor] Extract func `get_default_attn_backend` to `Plat…
MengqingCao Nov 19, 2024
74f8c2c
Add openai.beta.chat.completions.parse example to structured_outputs.…
mgoin Nov 19, 2024
272e31c
[Bugfix] Guard for negative counter metrics to prevent crash (#10430)
tjohnson31415 Nov 19, 2024
382b6a4
[Misc] Avoid misleading warning messages (#10438)
jeejeelee Nov 19, 2024
5390d66
[Doc] Add the start of an arch overview page (#10368)
russellb Nov 19, 2024
25f9c78
[misc][plugin] improve plugin loading (#10443)
youkaichao Nov 19, 2024
b461465
[CI][CPU] adding numa node number as container name suffix (#10441)
zhouyuan Nov 19, 2024
f028dff
[BugFix] Fix hermes tool parser output error stream arguments in some…
xiyuan-lee Nov 19, 2024
11fd7ea
[Pixtral-Large] Pixtral actually has no bias in vision-lang adapter (…
patrickvonplaten Nov 19, 2024
1ea291a
Fix: Build error seen on Power Architecture (#10421)
mikejuliet13 Nov 19, 2024
fd9f124
[Doc] fix link for page that was renamed (#10455)
russellb Nov 19, 2024
803f37e
[6/N] torch.compile rollout to users (#10437)
youkaichao Nov 19, 2024
efa9084
[Core] Avoid metrics log noise when idle (#8868)
russellb Nov 19, 2024
b00b33d
[Model][Quantization] HQQ support through Marlin kernel expansion (#9…
ElizaWszola Nov 19, 2024
a324d3a
Change granite chat template to keep json list formatting for tool ca…
maxdebayser Nov 20, 2024
d5b68ab
[CI/Build] Update Dockerfile.rocm (#10434)
Alexei-V-Ivanov-AMD Nov 20, 2024
d200972
[Bugfix] Marlin 2:4 temp fix for large M dim (>256) (#10464)
LucasWilkinson Nov 20, 2024
9e05252
[Misc] Add __setitem__ for LazyDict (#10469)
liuyanyi Nov 20, 2024
ad44437
[Bugfix] Fix Mamba model initialization and MLP Speculator weights lo…
Isotr0py Nov 20, 2024
b4be5a8
[Bugfix] Enforce no chunked prefill for embedding models (#10470)
DarkLight1337 Nov 20, 2024
709c9f1
[CI/Build] Add sphinx/rst linter for docs (#10366)
rafvasq Nov 20, 2024
7629a9c
[CI/Build] Support compilation with local cutlass path (#10423) (#10424)
wchen61 Nov 20, 2024
ed701ca
[ci/build] Combine nightly and optional (#10465)
khluu Nov 20, 2024
343041c
[model] Reduce medusa weight (#10454)
skylee-01 Nov 20, 2024
09dbf9f
[Bugfix] Handle conflicts between modern and legacy fields (#10471)
DarkLight1337 Nov 20, 2024
d5b2844
[Platforms] Refactor xpu code (#10468)
MengqingCao Nov 20, 2024
63f1fde
[Hardware][CPU] Support chunked-prefill and prefix-caching on CPU (#1…
bigPYJ1151 Nov 20, 2024
772a667
[platforms] restore xpu check for parallel config (#10479)
youkaichao Nov 20, 2024
5f1d6af
[perf bench] H200 development (#9768)
simon-mo Nov 20, 2024
0cd3d97
[7/N] torch.compile, reduce compilation time (#10460)
youkaichao Nov 20, 2024
c68f7ed
[Bugfix]: allow extra fields in requests to openai compatible server …
gcalmettes Nov 20, 2024
2f77b6c
[TPU] Implement prefix caching for TPUs (#10307)
WoosukKwon Nov 20, 2024
388ee3d
[torch.compile] limit inductor threads and lazy import quant (#10482)
youkaichao Nov 21, 2024
6c1208d
[Core] Add Sliding Window Support with Flashinfer (#10462)
pavanimajety Nov 21, 2024
9d82717
[Platforms] Add `device_type` in `Platform` (#10508)
MengqingCao Nov 21, 2024
8b0fe06
[torch.compile] Inductor code caching fix (#10273)
ProExpertProg Nov 21, 2024
3430857
[Misc] Increase default video fetch timeout (#10495)
DarkLight1337 Nov 21, 2024
aaddce5
[platforms] improve error message for unspecified platforms (#10520)
youkaichao Nov 21, 2024
f0e0238
[Doc] fix a small typo in docstring of llama_tool_parser (#10513)
FerdinandZhong Nov 21, 2024
1cfde82
[Model] Add Support for Multimodal Granite Models (#10291)
alex-jw-brooks Nov 21, 2024
8a93a59
fix the issue that len(tokenizer(prompt)["input_ids"]) > prompt_len (…
sywangyi Nov 21, 2024
d5ec121
[Model] Expose `dynamic_image_size` as mm_processor_kwargs for Intern…
Isotr0py Nov 21, 2024
4d676f0
[Bugfix] Embedding model pooling_type equals ALL and multi input's bu…
BBuf Nov 21, 2024
da7e702
[Bug]: When apply continue_final_message for OpenAI server, the "echo…
chaunceyjiang Nov 21, 2024
2385b60
[Kernel] Register punica ops directly (#10522)
jeejeelee Nov 21, 2024
c51e397
[Misc] Suppress duplicated logging regarding multimodal input pipelin…
ywang96 Nov 21, 2024
e7a8341
[Bugfix] Allow token ID-only inputs in Qwen2-Audio (#10536)
DarkLight1337 Nov 21, 2024
7560ae5
[8/N] enable cli flag without a space (#10529)
youkaichao Nov 21, 2024
f9310cb
[V1] Fix Compilation config & Enable CUDA graph by default (#10528)
WoosukKwon Nov 21, 2024
edec338
[CI][Installation] Avoid uploading CUDA 11.8 wheel (#10535)
cermeng Nov 21, 2024
cf656f5
[misc] improve error message (#10553)
youkaichao Nov 21, 2024
46fe9b4
[Minor] Revert change in offline inference example (#10545)
WoosukKwon Nov 21, 2024
9afa014
Add small example to metrics.rst (#10550)
mgoin Nov 21, 2024
aed0748
[Benchmark] Add new H100 machine (#10547)
simon-mo Nov 22, 2024
33e0a25
[9/N] torch.compile LLM usage (#10552)
youkaichao Nov 22, 2024
446c780
[Minor] Fix line-too-long (#10563)
WoosukKwon Nov 22, 2024
a111d01
[platforms] absorb worker cls difference into platforms folder (#10555)
youkaichao Nov 22, 2024
b6374e0
[Bugfix] Fix Phi-3 BNB quantization with tensor parallel (#9948)
Isotr0py Nov 22, 2024
11fcf0e
Remove token-adding chat embedding params (#10551)
noamgat Nov 22, 2024
db100c5
[bugfix] fix full graph tests (#10581)
youkaichao Nov 22, 2024
eebad39
[torch.compile] support all attention backends (#10558)
youkaichao Nov 22, 2024
97814fb
[v1] Refactor KVCacheManager for more hash input than token ids (#10507)
rickyyx Nov 22, 2024
948c859
support bitsandbytes quantization with qwen model (#10549)
zixuanzhang226 Nov 23, 2024
28598f3
[Core] remove temporary local variables in LLMEngine.__init__ (#10577)
russellb Nov 23, 2024
d345f40
[V1] EngineCore supports profiling (#10564)
Abatom Nov 23, 2024
d559979
[bugfix] fix cpu tests (#10585)
youkaichao Nov 23, 2024
9195dbd
[Bugfix][Frontend] Update Llama Chat Templates to also support Non-To…
tjohnson31415 Nov 23, 2024
ebda519
[Core] Fix broken log configuration (#10458)
russellb Nov 23, 2024
978b397
[Misc] Add pynccl wrappers for all_gather and reduce_scatter (#9432)
tlrmchlsmth Nov 23, 2024
4aba6e3
[core] gemma2 full context length support (#10584)
youkaichao Nov 23, 2024
7d8ffb3
[Bugfix] Internal Server Error when tool_choice is incorrect. (#10567)
shenoyvvarun Nov 23, 2024
cfea9c0
[Model] Fix Baichuan BNB online quantization (#10572)
CNTRYROA Nov 23, 2024
02a43f8
Update default max_num_batch_tokens for chunked prefill to 2048 (#10544)
mgoin Nov 23, 2024
7c25fe4
[AMD] Add support for GGUF quantization on ROCm (#10254)
kliuae Nov 23, 2024
4634a89
Prefix Cache Aware Scheduling [1/n] (#10128)
rickyyx Nov 23, 2024
c8acd80
[2/N] handling placeholders in merged multi-modal processor (#10485)
DarkLight1337 Nov 23, 2024
4cfe5d2
[Bugfix] `multi_modal_kwargs` broadcast for CPU tensor parallel (#10541)
Isotr0py Nov 23, 2024
86a44fb
[Platforms] Refactor openvino code (#10573)
ji-huazhong Nov 23, 2024
651f6c3
For ppc64le, disabled tests for now and addressed space issues (#10538)
npanpaliya Nov 23, 2024
04668eb
[Bugfix] Avoid import AttentionMetadata explicitly in Mllama (#10593)
Isotr0py Nov 23, 2024
17d8fc1
[bugfix] Fix example/tensorize_vllm_model tests (#10595)
jeejeelee Nov 24, 2024
1700c54
[Bugfix] Fix LoRA weight sharding (#10450)
jeejeelee Nov 24, 2024
1c445dc
[CI/Build] Print running script to enhance CI log readability (#10594)
jeejeelee Nov 24, 2024
eda2b35
Revert "Print running script to enhance CI log readability" (#10601)
youkaichao Nov 24, 2024
c055747
[model][utils] add extract_layer_index utility function (#10599)
youkaichao Nov 24, 2024
e4fbb14
[doc] update the code to add models (#10603)
youkaichao Nov 24, 2024
49628fe
[Doc] Update README.md with Ray Summit talk links (#10610)
zhuohan123 Nov 25, 2024
214efc2
Support Cross encoder models (#10400)
maxdebayser Nov 25, 2024
7ea3cd7
[Refactor][MISC] del redundant code in ParallelConfig.postinit (#10614)
MengqingCao Nov 25, 2024
571841b
[torch.compile] support encoder based models (#10613)
youkaichao Nov 25, 2024
a30a605
[Doc] Add encoder-based models to Supported Models page (#10616)
DarkLight1337 Nov 25, 2024
7c2134b
[torch.compile] force inductor threads (#10620)
jeejeelee Nov 25, 2024
6581378
[torch.compile] add warning for unsupported models (#10622)
youkaichao Nov 25, 2024
25d806e
[misc] add torch.compile compatibility check (#10618)
youkaichao Nov 25, 2024
05d1f8c
[misc] move functions to config.py (#10624)
youkaichao Nov 25, 2024
ed46f14
[Model] Support `is_causal` HF config field for Qwen2 model (#10621)
DarkLight1337 Nov 25, 2024
2b0879b
Super tiny little typo fix (#10633)
fzyzcjy Nov 25, 2024
d04b13a
[Bug]: Authorization ignored when root_path is set (#10606)
chaunceyjiang Nov 25, 2024
c27df94
[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devic…
wallashss Nov 25, 2024
452a4e8
[Docs] Add Snowflake Slides (#10641)
simon-mo Nov 25, 2024
b1d9205
[Model]: Add support for Aria model (#10514)
xffxff Nov 25, 2024
cf73f0c
[Model] Enable optional prefix when loading embedding models (#10639)
DarkLight1337 Nov 25, 2024
1b583cf
[Doc] Fix typos in docs (#10636)
DarkLight1337 Nov 25, 2024
9db713a
[Model] Add OLMo November 2024 model (#10503)
2015aroras Nov 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
58 changes: 42 additions & 16 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,11 @@ steps:
- image: badouralix/curl-jq
command:
- sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh

- wait

- label: "A100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: A100
plugins:
Expand Down Expand Up @@ -41,20 +44,43 @@ steps:
- name: devshm
emptyDir:
medium: Memory
# - label: "H100"
# agents:
# queue: H100
# plugins:
# - docker#v5.11.0:
# image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
# command:
# - bash
# - .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
# mount-buildkite-agent: true
# propagate-environment: true
# ipc: host
# gpus: all
# environment:
# - VLLM_USAGE_SOURCE
# - HF_TOKEN

- label: "H200"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H200
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: 4,5,6,7
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN

- label: "H100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H100
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: all # see CUDA_VISIBLE_DEVICES for actual GPUs used
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,18 @@ def results_to_json(latency, throughput, serving):
throughput_results,
serving_results)

for df in [latency_results, serving_results, throughput_results]:
if df.empty:
continue

# Sort all dataframes by their respective "Test name" columns
df.sort_values(by="Test name", inplace=True)

# The GPUs sometimes come in format of "GPUTYPE\nGPUTYPE\n...",
# we want to turn it into "8xGPUTYPE"
df["GPU"] = df["GPU"].apply(
lambda x: f"{len(x.split('\n'))}x{x.split('\n')[0]}")

# get markdown tables
latency_md_table = tabulate(latency_results,
headers='keys',
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

# Do not set -e, as the mixtral 8x22B model tends to crash occasionally
# and we still want to see other benchmarking results even when mixtral crashes.
set -x
set -o pipefail

check_gpus() {
Expand Down Expand Up @@ -85,11 +86,7 @@ kill_gpu_processes() {

ps -aux
lsof -t -i:8000 | xargs -r kill -9
pkill -f pt_main_thread
# this line doesn't work now
# ps aux | grep python | grep openai | awk '{print $2}' | xargs -r kill -9
pkill -f python3
pkill -f /usr/bin/python3
pgrep python3 | xargs -r kill -9


# wait until GPU memory usage smaller than 1GB
Expand Down Expand Up @@ -289,7 +286,7 @@ run_serving_tests() {
# run the server
echo "Running test case $test_name"
echo "Server command: $server_command"
eval "$server_command" &
bash -c "$server_command" &
server_pid=$!

# wait until the server is alive
Expand Down Expand Up @@ -322,7 +319,7 @@ run_serving_tests() {
echo "Running test case $test_name with qps $qps"
echo "Client command: $client_command"

eval "$client_command"
bash -c "$client_command"

# record the benchmarking commands
jq_output=$(jq -n \
Expand Down
21 changes: 8 additions & 13 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,28 +6,23 @@ steps:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
# rename the files to change linux -> manylinux1
- "for f in artifacts/dist/*.whl; do mv -- \"$$f\" \"$${f/linux/manylinux1}\"; done"
- "mv artifacts/dist/$(ls artifacts/dist) artifacts/dist/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl"
- "aws s3 cp artifacts/dist/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl s3://vllm-wheels/$BUILDKITE_COMMIT/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl"
- "aws s3 cp artifacts/dist/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl s3://vllm-wheels/nightly/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl"
- "bash .buildkite/upload-wheels.sh"
env:
DOCKER_BUILDKIT: "1"

- block: "Build CUDA 11.8 wheel"
key: block-build-cu118-wheel

# Note(simon): We can always build CUDA 11.8 wheel to ensure the build is working.
# However, this block can be uncommented to save some compute hours.
# - block: "Build CUDA 11.8 wheel"
# key: block-build-cu118-wheel

- label: "Build wheel - CUDA 11.8"
depends_on: block-build-cu118-wheel
# depends_on: block-build-cu118-wheel
agents:
queue: cpu_queue
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
# rename the files to change linux -> manylinux1
- "for f in artifacts/dist/*.whl; do mv -- \"$$f\" \"$${f/linux/manylinux1}\"; done"
- "aws s3 cp --recursive artifacts/dist s3://vllm-wheels/$BUILDKITE_COMMIT/"
- "aws s3 cp --recursive artifacts/dist s3://vllm-wheels/nightly/"
- "bash .buildkite/upload-wheels.sh"
env:
DOCKER_BUILDKIT: "1"
1 change: 0 additions & 1 deletion .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,6 @@ if [[ $commands == *" kernels "* ]]; then
--ignore=kernels/test_encoder_decoder_attn.py \
--ignore=kernels/test_flash_attn.py \
--ignore=kernels/test_flashinfer.py \
--ignore=kernels/test_gguf.py \
--ignore=kernels/test_int8_quant.py \
--ignore=kernels/test_machete_gemm.py \
--ignore=kernels/test_mamba_ssm.py \
Expand Down
44 changes: 3 additions & 41 deletions .buildkite/run-cpu-test-ppc64le.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,49 +4,11 @@
# It serves a sanity check for compilation and basic model usage.
set -ex

# Try building the docker image
docker build -t cpu-test -f Dockerfile.ppc64le .

# Setup cleanup
remove_docker_container() { docker rm -f cpu-test || true; }
remove_docker_container() { docker rm -f cpu-test || true; docker system prune -f; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image, setting --shm-size=4g for tensor parallel.
source /etc/environment
#docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test cpu-test
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN="$HF_TOKEN" --name cpu-test cpu-test

function cpu_tests() {
set -e

# Run basic model test
docker exec cpu-test bash -c "
set -e
pip install pytest pytest-asyncio \
decord einops librosa peft Pillow sentence-transformers soundfile \
transformers_stream_generator matplotlib datamodel_code_generator
pip install torchvision --index-url https://download.pytorch.org/whl/cpu
pytest -v -s tests/models/embedding/language
pytest -v -s tests/models/encoder_decoder/language
pytest -v -s tests/models/decoder_only/language/test_models.py
pytest -v -s tests/models/decoder_only/audio_language -m cpu_model
pytest -v -s tests/models/decoder_only/vision_language -m cpu_model"

# online inference
docker exec cpu-test bash -c "
set -e
python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m &
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
python3 benchmarks/benchmark_serving.py \
--backend vllm \
--dataset-name random \
--model facebook/opt-125m \
--num-prompts 20 \
--endpoint /v1/completions \
--tokenizer facebook/opt-125m"
}
# Try building the docker image
docker build -t cpu-test -f Dockerfile.ppc64le .

# All of CPU tests are expected to be finished less than 25 mins.
export -f cpu_tests
timeout 25m bash -c "cpu_tests"
41 changes: 24 additions & 17 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,59 +9,66 @@ CORE_RANGE=${CORE_RANGE:-48-95}
NUMA_NODE=${NUMA_NODE:-1}

# Try building the docker image
numactl -C $CORE_RANGE -N $NUMA_NODE docker build -t cpu-test -f Dockerfile.cpu .
numactl -C $CORE_RANGE -N $NUMA_NODE docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-avx2 -f Dockerfile.cpu .
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build -t cpu-test -f Dockerfile.cpu .
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-avx2 -f Dockerfile.cpu .

# Setup cleanup
remove_docker_container() { docker rm -f cpu-test cpu-test-avx2 || true; }
remove_docker_container() { docker rm -f cpu-test-"$NUMA_NODE" cpu-test-avx2-"$NUMA_NODE" || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image, setting --shm-size=4g for tensor parallel.
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus=$CORE_RANGE \
--cpuset-mems=$NUMA_NODE --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test cpu-test
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus=$CORE_RANGE \
--cpuset-mems=$NUMA_NODE --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-avx2 cpu-test-avx2
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-avx2-"$NUMA_NODE" cpu-test-avx2

function cpu_tests() {
set -e
export NUMA_NODE=$2

# offline inference
docker exec cpu-test-avx2 bash -c "
docker exec cpu-test-avx2-"$NUMA_NODE" bash -c "
set -e
python3 examples/offline_inference.py"

# Run basic model test
docker exec cpu-test bash -c "
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pip install pytest pytest-asyncio \
decord einops librosa peft Pillow sentence-transformers soundfile \
transformers_stream_generator matplotlib datamodel_code_generator
pip install torchvision --index-url https://download.pytorch.org/whl/cpu
pytest -v -s tests/models/embedding/language
pytest -v -s tests/models/encoder_decoder/language
pytest -v -s tests/models/decoder_only/language/test_models.py
pytest -v -s tests/models/decoder_only/language -m cpu_model
pytest -v -s tests/models/embedding/language -m cpu_model
pytest -v -s tests/models/encoder_decoder/language -m cpu_model
pytest -v -s tests/models/decoder_only/audio_language -m cpu_model
pytest -v -s tests/models/decoder_only/vision_language -m cpu_model"

# Run compressed-tensor test
docker exec cpu-test bash -c "
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_static_setup \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynamic_per_token"

# Run AWQ test
docker exec cpu-test bash -c "
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/quantization/test_ipex_quant.py"

# Run chunked-prefill and prefix-cache test
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -s -v -k cpu_model \
tests/basic_correctness/test_chunked_prefill.py"

# online inference
docker exec cpu-test bash -c "
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
export VLLM_CPU_KVCACHE_SPACE=10
export VLLM_CPU_OMP_THREADS_BIND=$CORE_RANGE
export VLLM_CPU_OMP_THREADS_BIND=$1
python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --dtype half &
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
python3 benchmarks/benchmark_serving.py \
Expand All @@ -75,4 +82,4 @@ function cpu_tests() {

# All of CPU tests are expected to be finished less than 25 mins.
export -f cpu_tests
timeout 25m bash -c "cpu_tests"
timeout 30m bash -c "cpu_tests $CORE_RANGE $NUMA_NODE"
2 changes: 1 addition & 1 deletion .buildkite/run-hpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --runtime=habana --name=hpu-test --network=host -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference.py
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference.py
Loading