Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy #5362

Merged
merged 152 commits into from
Jul 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
152 commits
Select commit Hold shift + click to select a range
f813e2e
Kuntai: add tgi and trt benchmarking script (initial version)
KuntaiDu Jun 9, 2024
d6cba46
update initial benchmarking script for lmdeploy
KuntaiDu Jun 14, 2024
2cc1023
Merge branch 'vllm-project:main' into kuntai-benchmark-dev
KuntaiDu Jun 18, 2024
5d8292b
Add download tokenizer script for lmdeploy
KuntaiDu Jun 19, 2024
a2dd7c9
add one-click runnable script for lmdeploy, parse tests from json file
KuntaiDu Jun 19, 2024
8416ce6
add nightly test json file
KuntaiDu Jun 19, 2024
df4ba8f
bug fix on tokenizer directory
KuntaiDu Jun 19, 2024
b974495
bug fix on getting model
KuntaiDu Jun 19, 2024
80d1c77
update test cases
KuntaiDu Jun 19, 2024
b3f3b0e
update parameter name
KuntaiDu Jun 19, 2024
9483acf
typo fix
KuntaiDu Jun 19, 2024
d72ae51
add wait_for_server
KuntaiDu Jun 19, 2024
0e819f0
update summarization script
KuntaiDu Jun 19, 2024
9181a1d
use pkill tp kill lmdeploy
KuntaiDu Jun 19, 2024
6e1936c
update script for tgi
KuntaiDu Jun 19, 2024
c6aded9
add install jq
KuntaiDu Jun 19, 2024
ccbcd18
reduce 7b llama tp to 1
KuntaiDu Jun 19, 2024
38cc38a
update lmdeploy tp
KuntaiDu Jun 19, 2024
832891e
bug fix
KuntaiDu Jun 19, 2024
5877806
update tensorrt script
KuntaiDu Jun 20, 2024
9972aba
update nightly suite
KuntaiDu Jun 20, 2024
6493679
add double quote
KuntaiDu Jun 20, 2024
e62cae6
add trt llm version
KuntaiDu Jun 20, 2024
9ce3589
update trt
KuntaiDu Jun 20, 2024
f634dee
update on how to kill the server
KuntaiDu Jun 20, 2024
b47e30b
update vllm nightly test
KuntaiDu Jun 20, 2024
ec8b295
disalbe vllm server log
KuntaiDu Jun 20, 2024
792ef7f
adjust how to kill processes in vllm
KuntaiDu Jun 20, 2024
4f67c96
update nightly tests
KuntaiDu Jun 20, 2024
f0fe30c
update summary results
KuntaiDu Jun 20, 2024
d097843
update summary results
KuntaiDu Jun 20, 2024
2c30f38
add upload_to_buildkite utility
KuntaiDu Jun 20, 2024
7304668
update kickoff pipeline to initiate nightly benchmark
KuntaiDu Jun 20, 2024
f811ef0
update kickoff pipeline
KuntaiDu Jun 20, 2024
1876048
update the label name
KuntaiDu Jun 20, 2024
0a4518d
bug fix: exit benchmarking script after finish benchmarking one appli…
KuntaiDu Jun 21, 2024
f47db88
make yapf, ruff and isort happy
KuntaiDu Jun 21, 2024
8f4da1b
give nightly pipeline higher priority
KuntaiDu Jun 21, 2024
2cbdac3
fix new bugs in latest lmdeploy docker
KuntaiDu Jun 22, 2024
c24b963
try llama 70B with tp 4
KuntaiDu Jun 24, 2024
f79b6c4
rebuild
KuntaiDu Jun 24, 2024
dfb77f4
use mixtral model to prevent disk quota exceeded
KuntaiDu Jun 24, 2024
d2e4171
remove wait
KuntaiDu Jun 24, 2024
dc52195
temporarily remove trt pipeline --- disk quota exceeded
KuntaiDu Jun 24, 2024
8ffd8b1
fall back to 70B and test the storage required
KuntaiDu Jun 24, 2024
313f54f
use llama-2 as I do not have llama3 access...)
KuntaiDu Jun 24, 2024
62b2407
fix model name
KuntaiDu Jun 24, 2024
785d246
try llama 70B
KuntaiDu Jun 25, 2024
7fd891e
check file system size
KuntaiDu Jun 25, 2024
52cf795
update code for removing cache
KuntaiDu Jun 25, 2024
b8d1c94
merge common parameters
KuntaiDu Jun 25, 2024
14fb650
fix typo
KuntaiDu Jun 25, 2024
3aba28a
reduce qps to 8, just for testing
KuntaiDu Jun 25, 2024
733ac33
append to the same context
KuntaiDu Jun 25, 2024
5e1ec4b
optimize for buildkite annotation
KuntaiDu Jun 25, 2024
2423106
add double enter for md table
KuntaiDu Jun 25, 2024
d3f9701
format adjust for markdown presentation
KuntaiDu Jun 25, 2024
c4d651b
move header to the description file
KuntaiDu Jun 25, 2024
25c5a2f
separate annotation to a new step
KuntaiDu Jun 25, 2024
5183fea
add extra step to annotate pipeline
KuntaiDu Jun 25, 2024
f0684af
add wait
KuntaiDu Jun 25, 2024
6c7ddf8
bring back the full test case
KuntaiDu Jun 25, 2024
13f5d99
fix syntax error
KuntaiDu Jun 25, 2024
11079c7
break when the server failed to start --- so that the buildkite uploa…
KuntaiDu Jun 25, 2024
0ed8131
make yapf happy
KuntaiDu Jun 25, 2024
fae306e
test vllm code
KuntaiDu Jun 26, 2024
5098e10
check if mounting is successfull
KuntaiDu Jun 26, 2024
b0e7667
add pwd
KuntaiDu Jun 26, 2024
4034f5f
add ls
KuntaiDu Jun 26, 2024
4ab2eca
update to read code from the docker, instead of running wget
KuntaiDu Jun 26, 2024
d6a34d3
remove raid
KuntaiDu Jun 26, 2024
c57ac0a
try Roger's fix
KuntaiDu Jun 26, 2024
d75d45b
remove nvme raid
KuntaiDu Jun 27, 2024
5dc8c8c
raise the priority of benchmarking development jobs
KuntaiDu Jun 28, 2024
8b91927
reduce the # of test from 1000 to 500, for faster testing
KuntaiDu Jun 28, 2024
8539874
trt won't run all the test. Just run llama-3 70B. Fix this bug tomorrow
KuntaiDu Jun 28, 2024
144328b
debug tensorrt
KuntaiDu Jun 30, 2024
64e9518
bug fix: avoid reassigning params during the for loop
KuntaiDu Jul 1, 2024
a94c140
bring lmdeploy back for testing
KuntaiDu Jul 1, 2024
1f0ccb0
change test name
KuntaiDu Jul 1, 2024
2f53b96
separating run server command from the bash file
KuntaiDu Jul 1, 2024
51d679e
clean up
KuntaiDu Jul 1, 2024
21c986d
run lmdeploy server in a separate process
KuntaiDu Jul 1, 2024
96bc249
bring back the full test suite
KuntaiDu Jul 1, 2024
6c566cb
bug fix: need to use llama checkpoint converter for mixtral model
KuntaiDu Jul 1, 2024
162700f
reduce test case to only mixtral, debug lmdeploy + mixtral
KuntaiDu Jul 2, 2024
b0d74cd
developing fp8 + tensorrt-llm
KuntaiDu Jul 2, 2024
f1a7955
move fp8 quantization to common parameters
KuntaiDu Jul 2, 2024
459fb2f
add fp8 for vllm
KuntaiDu Jul 2, 2024
79b295c
remove unused parameters
KuntaiDu Jul 2, 2024
019802a
use llama2 for local debugging
KuntaiDu Jul 2, 2024
3d20f92
move kv cache dtype inside vllm
KuntaiDu Jul 2, 2024
44e2d97
change model
KuntaiDu Jul 2, 2024
b8dbd8a
test fp8 performance
KuntaiDu Jul 3, 2024
0313c19
reduce calib size
KuntaiDu Jul 3, 2024
7b483a1
freeze fp16 benchmark
KuntaiDu Jul 3, 2024
2e577b3
Merge branch 'vllm-project:main' into kuntai-benchmark-dev
KuntaiDu Jul 3, 2024
c5e6662
add standard deviation for each metric -- to plot confidence interval
KuntaiDu Jul 3, 2024
22e78b5
remove annotation inside the job --- run the annotation at the last.
KuntaiDu Jul 3, 2024
59072ed
reduce nightly pipeline length
KuntaiDu Jul 3, 2024
a3e4355
remove headers in result
KuntaiDu Jul 3, 2024
e27677a
add visualization step
KuntaiDu Jul 4, 2024
7c845ae
support figure visualization
KuntaiDu Jul 4, 2024
3a70a60
adjust visualization
KuntaiDu Jul 4, 2024
8260d38
visual adjustment
KuntaiDu Jul 4, 2024
4643749
remove text annotation
KuntaiDu Jul 4, 2024
3146a96
add padding
KuntaiDu Jul 4, 2024
6da59d1
add hyperlink
KuntaiDu Jul 4, 2024
0802f9f
bring back the full suite of test
KuntaiDu Jul 4, 2024
8e6fca2
adjust test order
KuntaiDu Jul 4, 2024
a77fcbd
bring back full benchmark suite
KuntaiDu Jul 4, 2024
8b51f45
add more pad
KuntaiDu Jul 4, 2024
4427b06
mount huggingface cache
KuntaiDu Jul 4, 2024
329efe6
mount huggingface cache
KuntaiDu Jul 4, 2024
a174d26
add even more padding
KuntaiDu Jul 4, 2024
3f49b0c
add illustration
KuntaiDu Jul 4, 2024
5c3a7d0
make yapf and ruff happy
KuntaiDu Jul 4, 2024
ec6f42d
add datetime to filename and make yapf happy
KuntaiDu Jul 4, 2024
0090577
debug mixtral and llama70B
KuntaiDu Jul 4, 2024
f76a04a
pin lmdeploy transformers to 4.41.2
KuntaiDu Jul 4, 2024
859d6f3
skip the test case instead of exit the whoel test suite
KuntaiDu Jul 4, 2024
3ce4f5f
update transformers for mixtral model
KuntaiDu Jul 4, 2024
646114d
move transformers update
KuntaiDu Jul 4, 2024
b6058aa
typo fix
KuntaiDu Jul 4, 2024
ac4d137
bring back the full test suite
KuntaiDu Jul 4, 2024
11731f4
remove wrongfully-added results
KuntaiDu Jul 5, 2024
b012f71
adjust plotting & provide more details in nightly description
KuntaiDu Jul 5, 2024
4def302
adjust figure -- add grid, bar plot, color, +throughput
KuntaiDu Jul 7, 2024
6b77d2b
typo fix
KuntaiDu Jul 7, 2024
8c0259c
bug fix: set color using attribute
KuntaiDu Jul 7, 2024
2ee07df
mute curl output --- it's getting toooo long
KuntaiDu Jul 7, 2024
9547066
adjust coloring
KuntaiDu Jul 7, 2024
a3085a1
increase font size, adjust coloring
KuntaiDu Jul 7, 2024
0a554ae
adjust font size
KuntaiDu Jul 7, 2024
c6c9292
adjust spacing
KuntaiDu Jul 7, 2024
4788d27
increase font size
KuntaiDu Jul 7, 2024
ccc160c
increase cap size
KuntaiDu Jul 7, 2024
b6c5572
make yapf and ruff happy
KuntaiDu Jul 7, 2024
13d8c04
allow running performance benchmark & nightly benchmark simultaneously
KuntaiDu Jul 9, 2024
4d77e8f
adjust the annotation context for nightly benchmark so that it does n…
KuntaiDu Jul 9, 2024
da41c53
cut redundant lines in nightly-pipeline.yaml using yaml anchor
KuntaiDu Jul 9, 2024
c108084
add dpi=400
KuntaiDu Jul 9, 2024
57e6783
adjust pipeline upload order
KuntaiDu Jul 10, 2024
b057b4b
merge two pipelines using yq
KuntaiDu Jul 10, 2024
1053900
adjust merging logic
KuntaiDu Jul 10, 2024
5ef7e8a
put blocking step as the first step
KuntaiDu Jul 10, 2024
bbe115d
this file has been moved to vllm-project/buildkite-ci. Remove it.
KuntaiDu Jul 10, 2024
fb1e392
add warning message
KuntaiDu Jul 10, 2024
50ed6b7
add a wait at the end, essential when merging multiple yaml files
KuntaiDu Jul 10, 2024
9758f94
adjust pipeline.yaml
KuntaiDu Jul 11, 2024
8608d17
adjust pipeline.yaml
KuntaiDu Jul 11, 2024
37c4c11
fix pipeline yaml
KuntaiDu Jul 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .buildkite/nightly-benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# vLLM benchmark suite


## Introduction

This directory contains the performance benchmarking CI for vllm.
Expand Down
27 changes: 0 additions & 27 deletions .buildkite/nightly-benchmarks/kickoff-pipeline.sh

This file was deleted.

45 changes: 45 additions & 0 deletions .buildkite/nightly-benchmarks/nightly-descriptions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@

# Nightly benchmark

The main goal of this benchmarking is two-fold:
- Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
- Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in [reproduce.md]().


## Docker images

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:
- vllm/vllm-openai:v0.5.0.post1
- nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
- openmmlab/lmdeploy:v0.5.0
- ghcr.io/huggingface/text-generation-inference:2.1

<!-- Please check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/nightly-pipeline.yaml">nightly-pipeline.yaml</a> artifact for more details on how we deploy the docker images. -->


## Hardware

One AWS node with 8x NVIDIA A100 GPUs.


## Workload description

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:

- Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 500 prompts.
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
- Average QPS (query per second): 4 for the small model (llama-3 8B) and 2 for other two models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).

<!-- Check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/tests/nightly-tests.json">nightly-tests.json</a> artifact for more details. -->

## Plots

In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed.

<img src="artifact://nightly_results.png" alt="Benchmarking results" height=250 >

## Results

{nightly_results_benchmarking_table}
120 changes: 120 additions & 0 deletions .buildkite/nightly-benchmarks/nightly-pipeline.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
common_pod_spec: &common_pod_spec
priorityClassName: perf-benchmark
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
volumes:
- name: devshm
emptyDir:
medium: Memory
- name: hf-cache
hostPath:
path: /root/.cache/huggingface
type: Directory

common_container_settings: &common_container_settings
command:
- bash .buildkite/nightly-benchmarks/run-nightly-suite.sh
resources:
limits:
nvidia.com/gpu: 8
volumeMounts:
- name: devshm
mountPath: /dev/shm
- name: hf-cache
mountPath: /root/.cache/huggingface
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: HF_HOME
value: /root/.cache/huggingface
- name: VLLM_SOURCE_CODE_LOC
value: /workspace/build/buildkite/vllm/performance-benchmark
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token

steps:
- block: ":rocket: Ready for comparing vllm against alternatives? This will take 4 hours."
- label: "A100 trt benchmark"
priority: 100
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
<<: *common_pod_spec
containers:
- image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
<<: *common_container_settings

- label: "A100 lmdeploy benchmark"
priority: 100
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
<<: *common_pod_spec
containers:
- image: openmmlab/lmdeploy:v0.5.0
<<: *common_container_settings


- label: "A100 vllm benchmark"
priority: 100
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
<<: *common_pod_spec
containers:
- image: vllm/vllm-openai:latest
<<: *common_container_settings

- label: "A100 tgi benchmark"
priority: 100
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
<<: *common_pod_spec
containers:
- image: ghcr.io/huggingface/text-generation-inference:2.1
<<: *common_container_settings

- wait

- label: "Plot"
priority: 100
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
<<: *common_pod_spec
containers:
- image: vllm/vllm-openai:v0.5.0.post1
command:
- bash .buildkite/nightly-benchmarks/scripts/nightly-annotate.sh
resources:
limits:
nvidia.com/gpu: 8
volumeMounts:
- name: devshm
mountPath: /dev/shm
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: VLLM_SOURCE_CODE_LOC
value: /workspace/build/buildkite/vllm/performance-benchmark
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token

- wait
76 changes: 76 additions & 0 deletions .buildkite/nightly-benchmarks/run-nightly-suite.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
#!/bin/bash

set -o pipefail
set -x

check_gpus() {
# check the number of GPUs and GPU type.
declare -g gpu_count=$(nvidia-smi --list-gpus | wc -l)
if [[ $gpu_count -gt 0 ]]; then
echo "GPU found."
else
echo "Need at least 1 GPU to run benchmarking."
exit 1
fi
declare -g gpu_type=$(echo $(nvidia-smi --query-gpu=name --format=csv,noheader) | awk '{print $2}')
echo "GPU type is $gpu_type"
}

check_hf_token() {
# check if HF_TOKEN is available and valid
if [[ -z "$HF_TOKEN" ]]; then
echo "Error: HF_TOKEN is not set."
exit 1
elif [[ ! "$HF_TOKEN" =~ ^hf_ ]]; then
echo "Error: HF_TOKEN does not start with 'hf_'."
exit 1
else
echo "HF_TOKEN is set and valid."
fi
}

main() {

check_gpus
check_hf_token

df -h

(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
(which jq) || (apt-get update && apt-get -y install jq)

cd $VLLM_SOURCE_CODE_LOC/benchmarks
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json


# run lmdeploy
if which lmdeploy >/dev/null; then
echo "lmdeploy is available, redirect to run-lmdeploy-nightly.sh"
bash ../.buildkite/nightly-benchmarks/scripts/run-lmdeploy-nightly.sh
exit 0
fi

# run tgi
if [ -e /tgi-entrypoint.sh ]; then
echo "tgi is available, redirect to run-tgi-nightly.sh"
bash ../.buildkite/nightly-benchmarks/scripts/run-tgi-nightly.sh
exit 0
fi

# run trt
if which trtllm-build >/dev/null; then
echo "trtllm is available, redirect to run-trt-nightly.sh"
bash ../.buildkite/nightly-benchmarks/scripts/run-trt-nightly.sh
exit 0
fi

# run vllm
if [ -e /vllm-workspace ]; then
echo "vllm is available, redirect to run-vllm-nightly.sh"
bash ../.buildkite/nightly-benchmarks/scripts/run-vllm-nightly.sh
exit 0
fi

}

main "$@"
26 changes: 26 additions & 0 deletions .buildkite/nightly-benchmarks/scripts/download-tokenizer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
import argparse

from transformers import AutoTokenizer


def main(model, cachedir):
# Load the tokenizer and save it to the specified directory
tokenizer = AutoTokenizer.from_pretrained(model)
tokenizer.save_pretrained(cachedir)
print(f"Tokenizer saved to {cachedir}")


if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Download and save Hugging Face tokenizer")
parser.add_argument("--model",
type=str,
required=True,
help="Name of the model")
parser.add_argument("--cachedir",
type=str,
required=True,
help="Directory to save the tokenizer")

args = parser.parse_args()
main(args.model, args.cachedir)
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from lmdeploy.serve.openai.api_client import APIClient

api_client = APIClient("http://localhost:8000")
model_name = api_client.available_models[0]

print(model_name)
Loading
Loading