Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi node vLLM #530

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 112 additions & 0 deletions docs/source/use-vllm-as-backend.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -87,3 +87,115 @@ An optional key `metric_options` can be passed in the yaml file,
using the name of the metric or metrics, as defined in the `Metric.metric_name`.
In this case, the `codegen_pass@1:16` metric defined in our tasks will have the `num_samples` updated to 16,
independently of the number defined by default.


## Multi-node vLLM

It is entirely possible to use vLLM in a multi-node setting. For this, we will use Ray.
In these examples, we will assume that we are on a Slurm cluster where nodes do not have internet access.
Those scripts are heavily inspired by [https://github.com/NERSC/slurm-ray-cluster/](https://github.com/NERSC/slurm-ray-cluster/).

First you need to start the Ray cluster. It has one master, and you need to find an available
port on this node

```bash
function find_available_port {
printf $(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
}

PORT=$(find_available_port)
NODELIST=($(scontrol show hostnames $SLURM_JOB_NODELIST))
MASTER=${NODELIST[0]} # Name of master node
MASTER_IP=$(hostname --ip-address) # IP address of master node

function set_VLLM_HOST_IP {
export VLLM_HOST_IP=$(hostname --ip-address);
}
export -f set_VLLM_HOST_IP;

# Start the master
srun -N1 -n1 -c $(( SLURM_CPUS_PER_TASK/2 )) -w $MASTER bash -c "set_VLLM_HOST_IP; ray start --head --port=$PORT --block" &
sleep 5

# Start all other nodes :)
if [[ $SLURM_NNODES -gt 1 ]]; then
srun -N $(( SLURM_NNODES-1 )) --ntasks-per-node=1 -c $(( SLURM_CPUS_PER_TASK/2 )) -x $MASTER bash -c "set_VLLM_HOST_IP; ray start --address=$MASTER_IP:$PORT --block" &
sleep 5
fi
```

Then, once the Ray cluster is running, you can launch vLLM through lighteval.

```bash
set_VLLM_HOST_IP

MODEL_ARGS="pretrained=$MODEL_DIRECTORY,gpu_memory_utilisation=0.5,trust_remote_code=False,dtype=bfloat16,max_model_length=16384,tensor_parallel_size=<gpus_per_node * nnodes>"
TASK_ARGS="community|gpqa-fr|0|0,community|ifeval-fr|0|0"

srun -N1 -n1 -c $(( SLURM_CPUS_PER_TASK/2 )) --overlap -w $MASTER lighteval vllm "$MODEL_ARGS" "$TASK_ARGS" --custom-tasks $TASK_FILE
```

The full script is available here: [https://github.com/huggingface/lighteval/blob/main/examples/slurm/multi_node_vllm.slurm](https://github.com/huggingface/lighteval/blob/main/examples/slurm/multi_node_vllm.slurm)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The full script is available here: [https://github.com/huggingface/lighteval/blob/main/examples/slurm/multi_node_vllm.slurm](https://github.com/huggingface/lighteval/blob/main/examples/slurm/multi_node_vllm.slurm)
The full script is available [here](https://github.com/huggingface/lighteval/blob/main/examples/slurm/multi_node_vllm.slurm)


## With vLLM Serve

It is also possible to use the vLLM serve command to achieve a similar result.
It has the following benefits: can be queried by multiple jobs, can be launched only once when needing multiple evaluation,
has lower peak memory on rank 0.

We also need to start the Ray cluster, the exact same way as before. However, now,
before calling lighteval, we need to start our vllm server.

```bash
MODEL_NAME="Llama-3.2-1B-Instruct"
SERVER_PORT=$(find_available_port)
export OPENAI_API_KEY="I-love-vLLM"
export OPENAI_BASE_URL="http://localhost:$SERVER_PORT/v1"

vllm serve $MODEL_DIRECTORY \
--served-model-name $MODEL_NAME \
--api-key $OPENAI_API_KEY \
--enforce-eager \
--port $SERVER_PORT \
--tensor-parallel-size <gpus_per_node * nnodes> \
--dtype bfloat16 \
--max-model-len 16384 \
--gpu-memory-utilization 0.8 \
--disable-custom-all-reduce \
1>vllm.stdout 2>vllm.stderr &
```

In my case, I want the evaluation to be done within the same job, hence why vllm serve
is put in the background. Therefore, we need to wait until it is up & running before
launching lighteval

```bash
ATTEMPT=0
DELAY=5
MAX_ATTEMPTS=60 # Might need to be increased in case of very large model
until curl -s -o /dev/null -w "%{http_code}" $OPENAI_BASE_URL/models -H "Authorization: Bearer $OPENAI_API_KEY" | grep -E "^2[0-9]{2}$"; do
ATTEMPT=$((ATTEMPT + 1))
echo "$ATTEMPT attempts"
if [ "$ATTEMPT" -ge "$MAX_ATTEMPTS" ]; then
echo "Failed: the server did not respond any of the $MAX_ATTEMPTS requests."
exit 1
fi
echo "vllm serve is not ready yet"
sleep $DELAY
done
```

Finally, the above script only finishes when vllm serve is ready, so we can launch
the evaluation.

```bash
export TOKENIZER_PATH="$MODEL_DIRECTORY"

TASK_ARGS="community|gpqa-fr|0|0,community|ifeval-fr|0|0"

lighteval endpoint openai "$MODEL_NAME" "$TASK_ARGS" --custom-tasks $TASK_FILE
```

The full script is available here:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The full script is available here:
The full script is available [here](https://github.com/huggingface/lighteval/blob/main/examples/slurm/multi_node_vllm_serve.slurm).

[https://github.com/huggingface/lighteval/blob/main/examples/slurm/multi_node_vllm_serve.slurm](https://github.com/huggingface/lighteval/blob/main/examples/slurm/multi_node_vllm_serve.slurm)
Copy link
Member

@NathanHB NathanHB Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[https://github.com/huggingface/lighteval/blob/main/examples/slurm/multi_node_vllm_serve.slurm](https://github.com/huggingface/lighteval/blob/main/examples/slurm/multi_node_vllm_serve.slurm)


57 changes: 57 additions & 0 deletions examples/slurm/multi_node_vllm.slurm
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#! /bin/bash

#SBATCH --job-name=EVALUATE_Llama-3.2-1B-Instruct
#SBATCH --account=brb@h100
Copy link
Member

@NathanHB NathanHB Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#SBATCH --account=brb@h100

#SBATCH --output=evaluation.log
#SBATCH --error=evaluation.log
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=96
#SBATCH --gres=gpu:4
#SBATCH --hint=nomultithread
#SBATCH --constraint=h100
#SBATCH --time=02:00:00
#SBATCH --exclusive
#SBATCH --parsable

set -e
set -x

module purge
module load miniforge/24.9.0
conda activate $WORKDIR/lighteval-h100

function find_available_port {
printf $(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
}

PORT=$(find_available_port)
NODELIST=($(scontrol show hostnames $SLURM_JOB_NODELIST))
MASTER=${NODELIST[0]}
MASTER_IP=$(hostname --ip-address)

export HF_HOME=$WORKDIR/HF_HOME
export MODEL_DIRECTORY=$WORKDIR/HuggingFace_Models/meta-llama/Llama-3.2-1B-Instruct
export TASK_FILE=$WORKDIR/community_tasks/french_eval.py
export HF_HUB_OFFLINE=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn

function set_VLLM_HOST_IP {
export VLLM_HOST_IP=$(hostname --ip-address);
}
export -f set_VLLM_HOST_IP;

srun -N1 -n1 -c $(( SLURM_CPUS_PER_TASK/2 )) -w $MASTER bash -c "set_VLLM_HOST_IP; ray start --head --port=$PORT --block" &
sleep 5

if [[ $SLURM_NNODES -gt 1 ]]; then
srun -N $(( SLURM_NNODES-1 )) --ntasks-per-node=1 -c $(( SLURM_CPUS_PER_TASK/2 )) -x $MASTER bash -c "set_VLLM_HOST_IP; ray start --address=$MASTER_IP:$PORT --block" &
sleep 5
fi

set_VLLM_HOST_IP

MODEL_ARGS="pretrained=$MODEL_DIRECTORY,gpu_memory_utilisation=0.5,trust_remote_code=False,dtype=bfloat16,max_model_length=8192,tensor_parallel_size=8"
TASK_ARGS="community|gpqa-fr|0|0,community|ifeval-fr|0|0"

srun -N1 -n1 -c $(( SLURM_CPUS_PER_TASK/2 )) --overlap -w $MASTER lighteval vllm "$MODEL_ARGS" "$TASK_ARGS" --custom-tasks $TASK_FILE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
srun -N1 -n1 -c $(( SLURM_CPUS_PER_TASK/2 )) --overlap -w $MASTER lighteval vllm "$MODEL_ARGS" "$TASK_ARGS" --custom-tasks $TASK_FILE
srun -N1 -n1 -c $(( SLURM_CPUS_PER_TASK/2 )) --overlap -w $MASTER lighteval vllm "$MODEL_ARGS" "$TASK_ARGS" --custom-tasks $TASK_FILE --use-chat-template

90 changes: 90 additions & 0 deletions examples/slurm/multi_node_vllm_serve.slurm
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
#! /bin/bash

#SBATCH --job-name=EVALUATE_Llama-3.2-1B-Instruct
#SBATCH --account=brb@h100
#SBATCH --output=evaluation.log
#SBATCH --error=evaluation.log
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=96
#SBATCH --gres=gpu:4
#SBATCH --hint=nomultithread
#SBATCH --constraint=h100
#SBATCH --time=02:00:00
#SBATCH --exclusive
#SBATCH --parsable

set -e
set -x

module purge
module load miniforge/24.9.0
conda activate $WORKDIR/lighteval-h100

function find_available_port {
printf $(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
}

PORT=$(find_available_port)
NODELIST=($(scontrol show hostnames $SLURM_JOB_NODELIST))
MASTER=${NODELIST[0]}
MASTER_IP=$(hostname --ip-address)

export HF_HOME=$WORKDIR/HF_HOME
export MODEL_DIRECTORY=$WORKDIR/HuggingFace_Models/meta-llama/Llama-3.2-1B-Instruct
export TASK_FILE=$WORKDIR/community_tasks/french_eval.py
export HF_HUB_OFFLINE=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn

function set_VLLM_HOST_IP {
export VLLM_HOST_IP=$(hostname --ip-address);
}
export -f set_VLLM_HOST_IP;

srun -N1 -n1 -c $(( SLURM_CPUS_PER_TASK/2 )) -w $MASTER bash -c "set_VLLM_HOST_IP; ray start --head --port=$PORT --block" &
sleep 5

if [[ $SLURM_NNODES -gt 1 ]]; then
srun -N $(( SLURM_NNODES-1 )) --ntasks-per-node=1 -c $(( SLURM_CPUS_PER_TASK/2 )) -x $MASTER bash -c "set_VLLM_HOST_IP; ray start --address=$MASTER_IP:$PORT --block" &
sleep 5
fi

set_VLLM_HOST_IP

MODEL_NAME="Llama-3.2-1B-Instruct"
SERVER_PORT=$(find_available_port)
export OPENAI_API_KEY="I-love-vllm-serve"
export OPENAI_BASE_URL="http://localhost:$SERVER_PORT/v1"
Comment on lines +56 to +57
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed if using litellm


vllm serve $MODEL_DIRECTORY \
--served-model-name $MODEL_NAME \
--api-key $OPENAI_API_KEY \
--enforce-eager \
--port $SERVER_PORT \
--tensor-parallel-size 8 \
--dtype bfloat16 \
--max-model-len 16384 \
--gpu-memory-utilization 0.8 \
--disable-custom-all-reduce \
1>vllm.stdout 2>vllm.stderr &


ATTEMPT=0
DELAY=5
MAX_ATTEMPTS=60
until curl -s -o /dev/null -w "%{http_code}" $OPENAI_BASE_URL/models -H "Authorization: Bearer $OPENAI_API_KEY" | grep -E "^2[0-9]{2}$"; do
ATTEMPT=$((ATTEMPT + 1))
echo "$ATTEMPT attempts"
if [ "$ATTEMPT" -ge "$MAX_ATTEMPTS" ]; then
echo "Failed: the server did not respond any of the $MAX_ATTEMPTS requests."
exit 1
fi
echo "vllm serve is not ready yet"
sleep $DELAY
done

export TOKENIZER_PATH="$MODEL_DIRECTORY"

TASK_ARGS="community|gpqa-fr|0|0,community|ifeval-fr|0|0"

lighteval endpoint openai "$MODEL_NAME" "$TASK_ARGS" --custom-tasks $TASK_FILE
Copy link
Member

@NathanHB NathanHB Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use lightellm as it is better maintained in lighteval,

Suggested change
lighteval endpoint openai "$MODEL_NAME" "$TASK_ARGS" --custom-tasks $TASK_FILE
lighteval endpoint litellm "model_name=${MODEL_NAME},base_url=http://localhost:$SERVER_PORT/v1,provider=openai" "$TASK_ARGS" --custom-tasks $TASK_FILE --use-chat-template

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to add --use-chat-template with litellm and openai endpoints

3 changes: 2 additions & 1 deletion src/lighteval/models/endpoints/openai_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,8 @@ def __init__(self, config: OpenAIModelConfig, env_config) -> None:
try:
self._tokenizer = tiktoken.encoding_for_model(self.model)
except KeyError:
self._tokenizer = AutoTokenizer.from_pretrained(self.model)
tokenizer_path = os.environ.get("TOKENIZER_PATH", self.model)
self._tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
self.pairwise_tokenization = False

def __call_api(self, prompt, return_logits, max_new_tokens, num_samples, logit_bias):
Expand Down
12 changes: 10 additions & 2 deletions src/lighteval/models/vllm/vllm_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ class VLLMModelConfig:
)
pairwise_tokenization: bool = False # whether to tokenize the context and continuation separately or together.
generation_parameters: GenerationParameters = None # sampling parameters to use for generation
enforce_eager: bool = False # whether or not to disable cuda graphs with vllm

subfolder: Optional[str] = None

Expand Down Expand Up @@ -137,13 +138,19 @@ def tokenizer(self):
return self._tokenizer

def cleanup(self):
destroy_model_parallel()
if ray is not None:
ray.get(ray.remote(destroy_model_parallel).remote())
else:
destroy_model_parallel()
if self.model is not None:
del self.model.llm_engine.model_executor.driver_worker
self.model = None
gc.collect()
ray.shutdown()
destroy_distributed_environment()
if ray is not None:
ray.get(ray.remote(destroy_distributed_environment).remote())
else:
destroy_distributed_environment()
torch.cuda.empty_cache()

@property
Expand Down Expand Up @@ -183,6 +190,7 @@ def _create_auto_model(self, config: VLLMModelConfig, env_config: EnvConfig) ->
"max_model_len": self._max_length,
"swap_space": 4,
"seed": 1234,
"enforce_eager": config.enforce_eager,
}
if int(config.data_parallel_size) > 1:
self.model_args["distributed_executor_backend"] = "ray"
Expand Down