Skip to content

Commit

Permalink
Merge pull request #11 from ShangmingCai/new_mooncake_integration_doc
Browse files Browse the repository at this point in the history
docs: add new vllm-integration guide.
  • Loading branch information
stmatengss authored Dec 4, 2024
2 parents be7d7ca + 688a3c8 commit 1c71015
Show file tree
Hide file tree
Showing 2 changed files with 179 additions and 1 deletion.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,11 +67,13 @@ Thanks to the high performance of Transfer Engine, P2P Stores can also distribut

![p2p-store.gif](image/p2p-store.gif)

### vLLM Integration ([Guide](doc/en/vllm-integration.md))
### vLLM Integration ([Guide v0.1](doc/en/vllm-integration.md), [v0.2-Nightly](doc/en/vllm-integration-v0.2-nightly.md))
To optimize LLM inference, the vLLM's community is working at supporting [disaggregated prefilling (PR 8498)](https://github.com/vllm-project/vllm/pull/8498). This feature allows separating the **prefill** phase from the **decode** phase in different processes. The vLLM uses `nccl` and `gloo` as the transport layer by default, but currently it cannot efficiently decouple both phases in different machines.

We have implemented vLLM integration, which uses Transfer Engine as the network layer instead of `nccl` and `gloo`, to support **inter-node KVCache transfer**. Transfer Engine provides simpler interface and more efficient use of RDMA devices. In the future, we plan to build Mooncake Store on the basis of Transfer Engine, which supports pooled prefill/decode disaggregation.

**_Update[Dec 4, 2024]: Here is the nightly vLLM Integration ([Guide v0.2-Nightly](doc/en/vllm-integration-v0.2-nightly.md)) that is based on vLLM's main branch._**

#### Performance
By supporting Topology Aware Path Selection and multi-card bandwidth aggregation, TTFT of vLLM with Transfer Engine is up to 33% lower than traditional TCP-based transports.
In the future, we will further improve TTFT through GPUDirect RDMA and zero-copy.
Expand Down
176 changes: 176 additions & 0 deletions doc/en/vllm-integration-v0.2-nightly.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# vLLM Disaggregated Prefill/Decode Demo

## Overview
This is the nightly version of mooncake-transfer-engine integration with the vLLM project based on [PR 10502](https://github.com/vllm-project/vllm/pull/10502) (vllm version: v0.6.4.post1/main) to accelerate KVCache transfer for inter-node disaggregated Prefill/Decode scenario. Benchmark results will be added soon.

**_Please note that this is not a fully ready version and will be modified anytime based on feedback from the vLLM community._**

## Installation
### Prerequisite
Please install the Mooncake Transfer Engine according to the [instructions](build.md) first.

### Install an experimental version of vLLM
#### 1. Clone vLLM from an indicated repo
```bash
git clone [email protected]:kvcache-ai/vllm.git
```
#### 2. Build
##### 2.1 Build from source (Include C++ and CUDA code)
```bash
cd vllm
git checkout upstream-mooncake-integration
pip3 uninstall vllm -y
pip3 install -e .
```
- **If the build fails, try upgrading the version of cmake through `pip3 install cmake --upgrade`.**
- If you encounter any problems that you cannot solve, please refer to the [vLLM official compilation guide](https://docs.vllm.ai/en/v0.6.4.post1/getting_started/installation.html#install-the-latest-code).

## Configuration
### Prepare configuration file to Run Example over RDMA

- Prepare a _**mooncake.json**_ file for both Prefill and Decode instances
- **You don't need to change the `prefill_url` and `decode_url` of the config file in the decode side, please use the identical config file.**

```json
{
"prefill_url": "192.168.0.137:13003",
"decode_url": "192.168.0.139:13003",
"metadata_server": "192.168.0.139:2379",
"protocol": "rdma",
"device_name": "erdma_0"
}
```
- "prefill_url": The IP address and port of the Prefill node.
- The port in the URL is used to communicate with etcd server for metadata.
- "decode_url": The IP address and port of the Decode node.
- The port in the URL is used to communicate with etcd server for metadata.
- **_If you want to run the prefill instance and decode instance on the same node, please set up a different port for the `decode_url`. To avoid port conflicts, ensure that the port number differs by at least 50 from the port number in `prefill_url`. For example, "decode_url": "192.168.0.137:13103"._**
- "metadata_server": The etcd server of mooncake transfer engine.
- "protocol": The protocol to be used for data transmission. ("rdma/tcp")
- "device_name": The device to be used for data transmission, required when "protocol" is set to "rdma". If multiple NIC devices are used, they can be separated by commas such as "erdma_0,erdma_1". Please note that there are no spaces between them.


### Prepare configuration file to Run Example over TCP

- Prepare a _**mooncake.json**_ file for both Prefill and Decode instances
```json
{
"prefill_url": "192.168.0.137:13003",
"decode_url": "192.168.0.139:13003",
"metadata_server": "192.168.0.139:2379",
"protocol": "tcp",
"device_name": ""
}
```


## Run Example
- Please change the IP addresses and ports in the following guide according to your env.
```bash
# Begin from `root` of your cloned repo!

# 1. Start the etcd server
etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379
# You may need to terminate other etcd processes before running the above command

# 2. Run on the prefilling side (producer role)
MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8100 --max-model-len 10000 --gpu-memory-utilization 0.95 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9, "kv_ip": "192.168.0.137", "kv_port": 51000 }'

# 3. Run on the decoding side (consumer role)
MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8200 --max-model-len 10000 --gpu-memory-utilization 0.95 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":5e9, "kv_ip": "192.168.0.137", "kv_port": 51000}'
```

- `MOONCAKE_CONFIG_PATH` is the path to the mooncake.json configuration file.
- `VLLM_USE_MODELSCOPE` is optional, if you have access to huggingface, please remove it.
- The `--model` parameter specifies the model to use.
- The `--port` parameter specifies the vllm service port on which to listen.
- The `--max-model-len` parameter specifies the maximum length of the model.
- Option `--tensor_parallel_size` \ `-tp` is supported now. But you need to set up `--enforce_eager` to disable cuda graph. Example: append `-tp 2 --enforce_eager` to the run command.
- If you want to run the prefill instance and decode instance on the same node, please set up different `CUDA_VISIBLE_DEVICES`. For example, `CUDA_VISIBLE_DEVICES=0,1` for the prefill instance and `CUDA_VISIBLE_DEVICES=2,3` for the decode instance.

- The `--kv-transfer-config` parameter specifies the connector and its config to be used.
- Please set up `kv_connector` to `MooncakeConnector`.
- `kv_role` is the node's role, either 'kv_producer' or 'kv_consumer'.
- `kv_rank` is the rank of the instance. Currently, `kv_producer`'s rank is 0, `kv_consumer`'s rank is 1.
- `kv_ip` and `kv_port` are used to specify the IP address and port of the master node in a distributed setup for disaggregated prefill feature. **_Be sure to set up the same `kv_ip` and same `kv_port` on each node._**

```bash
# 4. Start the proxy server on one node (Let's take the prefill node as an example)
python3 proxy_server.py
```
The implementation of `proxy_server.py`
```python
import os

import aiohttp
from quart import Quart, make_response, request

AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=6 * 60 * 60)

app = Quart(__name__)


async def forward_request(url, data):
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
headers = {
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
}
async with session.post(url=url, json=data,
headers=headers) as response:
if response.status == 200:
if True:
async for chunk_bytes in response.content.iter_chunked(
1024):
yield chunk_bytes
else:
content = await response.read()
yield content


@app.route('/v1/completions', methods=['POST'])
async def handle_request():
try:
original_request_data = await request.get_json()

prefill_request = original_request_data.copy()
# change max_tokens = 1 to let it only do prefill
prefill_request['max_tokens'] = 1

# finish prefill
async for _ in forward_request('http://localhost:8100/v1/completions',
prefill_request):
continue

# return decode
generator = forward_request('http://192.168.0.139:8200/v1/completions', # Be sure to change the IP address for your machine
original_request_data)
response = await make_response(generator)
response.timeout = None

return response

except Exception as e:
import sys
import traceback
exc_info = sys.exc_info()
print("Error occurred in disagg prefill proxy server")
print(e)
print("".join(traceback.format_exception(*exc_info)))


if __name__ == '__main__':
app.run(host="0.0.0.0",port=8000)
```

**_Be sure to change the IP address in the code._**


## Test with openai compatible request
```
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4",
"prompt": "San Francisco is a",
"max_tokens": 1000
}'
```
- If you are not testing on the proxy server, please change the `localhost` to the IP address of the proxy server.

0 comments on commit 1c71015

Please sign in to comment.