From 78bb8b8d5f7f6c4055248046eaf306779850c6e4 Mon Sep 17 00:00:00 2001
From: Shangming Cai <caishangming@linux.alibaba.com>
Date: Tue, 3 Dec 2024 19:24:50 +0800
Subject: [PATCH 1/6] docs: add new vllm-integration guide.

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
---
 doc/en/vllm-integration-new.md | 176 +++++++++++++++++++++++++++++++++
 1 file changed, 176 insertions(+)
 create mode 100644 doc/en/vllm-integration-new.md

diff --git a/doc/en/vllm-integration-new.md b/doc/en/vllm-integration-new.md
new file mode 100644
index 0000000..330eba7
--- /dev/null
+++ b/doc/en/vllm-integration-new.md
@@ -0,0 +1,176 @@
+# vLLM Disaggregated Prefill/Decode Demo
+
+## Overview
+This is the nightly version of mooncake-transfer-engine integration with the vLLM project based on [PR 10502](https://github.com/vllm-project/vllm/pull/10502) (vllm version: v0.6.4.post1) to accelerate KVCache transfer for inter-node disaggregated Prefill/Decode scenario. Benchmark results will be added soon.
+
+![vllm-integration-demo](../../image/vllm-integration-demo.gif)
+
+## Installation
+### Prerequisite
+Please install the Mooncake Transfer Engine according to the [instructions](build.md) first.
+
+### Install an experimental version of vLLM
+#### 1. Clone vLLM from an indicated repo
+```bash
+git clone git@github.com:kvcache-ai/vllm.git
+```
+#### 2. Build
+##### 2.1 Build from source (Include C++ and CUDA code)
+```bash
+cd vllm
+git checkout upstream-mooncake-integration
+pip3 uninstall vllm -y
+pip3 install -e .
+```
+ - **If the build fails, try upgrading the version of cmake through `pip3 install cmake --upgrade`.**
+ - If you encounter any problems that you cannot solve, please refer to the [vLLM official compilation guide](https://docs.vllm.ai/en/v0.6.4.post1/getting_started/installation.html#install-the-latest-code).
+
+## Configuration
+### Prepare configuration file to Run Example over RDMA
+
+- Prepare a _**mooncake.json**_ file for both Prefill and Decode instances
+- **You don't need to change the `prefill_url` and `decode_url` of the config file in the decode side, please use the identical config file.**
+
+```json
+{
+  "prefill_url": "192.168.0.137:13003",
+  "decode_url": "192.168.0.139:13003",
+  "metadata_server": "192.168.0.139:2379",
+  "protocol": "rdma",
+  "device_name": "erdma_0"
+}
+```
+- "prefill_url": The IP address and port of the Prefill node.
+  - The port in the URL is used to communicate with etcd server for metadata.
+- "decode_url": The IP address and port of the Decode node.
+  - The port in the URL is used to communicate with etcd server for metadata.
+  - **_If you want to run the prefill instance and decode instance on the same node, please set up a different port for the `decode_url`. To avoid port conflicts, ensure that the port number differs by at least 50 from the port number in `prefill_url`. For example, "decode_url": "192.168.0.137:13103"._**
+- "metadata_server": The etcd server of mooncake transfer engine.
+- "protocol": The protocol to be used for data transmission. ("rdma/tcp")
+- "device_name": The device to be used for data transmission, required when "protocol" is set to "rdma". If multiple NIC devices are used, they can be separated by commas such as "erdma_0,erdma_1". Please note that there are no spaces between them.
+
+
+### Prepare configuration file to Run Example over TCP
+
+- Prepare a _**mooncake.json**_ file for both Prefill and Decode instances
+```json
+{
+  "prefill_url": "192.168.0.137:13003",
+  "decode_url": "192.168.0.139:13003",
+  "metadata_server": "192.168.0.139:2379",
+  "protocol": "tcp",
+  "device_name": ""
+}
+```
+
+
+## Run Example
+ - Please change the IP addresses and ports in the following guide according to your env.
+```bash
+# Begin from `root` of your cloned repo!
+
+# 1. Start the etcd server
+etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379
+# You may need to terminate other etcd processes before running the above command
+
+# 2. Run on the prefilling side (producer role)
+MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8100 --max-model-len 10000 --gpu-memory-utilization 0.95 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9, "kv_ip": "192.168.0.137", "kv_port": 51000 }'
+
+# 3. Run on the decoding side (consumer role)
+MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8200 --max-model-len 10000 --gpu-memory-utilization 0.95 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":5e9, "kv_ip": "192.168.0.137", "kv_port": 51000}'
+```
+
+- `MOONCAKE_CONFIG_PATH` is the path to the mooncake.json configuration file.
+- `VLLM_USE_MODELSCOPE` is optional, if you have access to huggingface, please remove it.
+- The `--model` parameter specifies the model to use.
+- The `--port` parameter specifies the vllm service port on which to listen.
+- The `--max-model-len` parameter specifies the maximum length of the model.
+- Option `--tensor_parallel_size` \ `-tp` is supported now. But you need to set up `--enforce_eager` to disable cuda graph. Example: append `-tp 2 --enforce_eager` to the run command.
+  - If you want to run the prefill instance and decode instance on the same node, please set up different `CUDA_VISIBLE_DEVICES`. For example, `CUDA_VISIBLE_DEVICES=0,1` for the prefill instance and `CUDA_VISIBLE_DEVICES=2,3` for the decode instance.
+
+- The `--kv-transfer-config` parameter specifies the connector and its config to be used.
+  - Please set up `kv_connector` to `MooncakeConnector`.
+  - `kv_role` is the node's role, either 'kv_producer' or 'kv_consumer'.
+  - `kv_rank` is the rank of the instance. Currently, `kv_producer`'s rank is 0, `kv_consumer`'s rank is 1.
+  - `kv_ip` and `kv_port` are used to specify the IP address and port of the master node in a distributed setup for disaggregated prefill feature. **_Be sure to set up the same `kv_ip` and same `kv_port` on each node._**
+
+```bash
+# 4. Start the proxy server on one node (Let's take the prefill node as an example)
+python3 proxy_server.py
+```
+The implementation of `proxy_server.py`
+```python
+import os
+
+import aiohttp
+from quart import Quart, make_response, request
+
+AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=6 * 60 * 60)
+
+app = Quart(__name__)
+
+
+async def forward_request(url, data):
+    async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
+        headers = {
+            "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
+        }
+        async with session.post(url=url, json=data,
+                                headers=headers) as response:
+            if response.status == 200:
+                if True:
+                    async for chunk_bytes in response.content.iter_chunked(
+                            1024):
+                        yield chunk_bytes
+                else:
+                    content = await response.read()
+                    yield content
+
+
+@app.route('/v1/completions', methods=['POST'])
+async def handle_request():
+    try:
+        original_request_data = await request.get_json()
+
+        prefill_request = original_request_data.copy()
+        # change max_tokens = 1 to let it only do prefill
+        prefill_request['max_tokens'] = 1
+
+        # finish prefill
+        async for _ in forward_request('http://localhost:8100/v1/completions',
+                                       prefill_request):
+            continue
+
+        # return decode
+        generator = forward_request('http://192.168.0.139:8200/v1/completions', # Be sure to change the IP address for your machine
+                                    original_request_data)
+        response = await make_response(generator)
+        response.timeout = None
+
+        return response
+
+    except Exception as e:
+        import sys
+        import traceback
+        exc_info = sys.exc_info()
+        print("Error occurred in disagg prefill proxy server")
+        print(e)
+        print("".join(traceback.format_exception(*exc_info)))
+
+
+if __name__ == '__main__':
+    app.run(host="0.0.0.0",port=8000)
+```
+
+**_Be sure to change the IP address in the code._**
+
+
+## Test with openai compatible request
+```
+curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
+  "model": "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4",
+  "prompt": "San Francisco is a",
+  "max_tokens": 1000
+}'
+```
+- If you are not testing on the proxy server, please change the `localhost` to the IP address of the proxy server.

From 3c42ec6145e5b174f7f0e8fbae6e7cfd67c23f5c Mon Sep 17 00:00:00 2001
From: Shangming Cai <caishangming@linux.alibaba.com>
Date: Tue, 3 Dec 2024 20:09:41 +0800
Subject: [PATCH 2/6] docs: update README to add new vllm integration guide.

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
---
 README.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 2fb0bd7..415d386 100644
--- a/README.md
+++ b/README.md
@@ -67,11 +67,13 @@ Thanks to the high performance of Transfer Engine, P2P Stores can also distribut
 
 ![p2p-store.gif](image/p2p-store.gif)
 
-### vLLM Integration ([Guide](doc/en/vllm-integration.md))
+### vLLM Integration ([Guide v0.1](doc/en/vllm-integration.md), [v0.2-Nightly](doc/en/vllm-integration-new.md))
 To optimize LLM inference, the vLLM's community is working at supporting [disaggregated prefilling (PR 8498)](https://github.com/vllm-project/vllm/pull/8498). This feature allows separating the **prefill** phase from the **decode** phase in different processes. The vLLM uses `nccl` and `gloo` as the transport layer by default, but currently it cannot efficiently decouple both phases in different machines.
 
 We have implemented vLLM integration, which uses Transfer Engine as the network layer instead of `nccl` and `gloo`, to support **inter-node KVCache transfer**. Transfer Engine provides simpler interface and more efficient use of RDMA devices. In the future, we plan to build Mooncake Store on the basis of Transfer Engine, which supports pooled prefill/decode disaggregation.
 
+**_Update[Dec 3, 2024]: Here is the nightly vLLM Integration ([Guide v0.2-Nightly](doc/en/vllm-integration-new.md)) that is based on vLLM's main branch._**
+
 #### Performance
 By supporting Topology Aware Path Selection and multi-card bandwidth aggregation, TTFT of vLLM with Transfer Engine is up to 33% lower than traditional TCP-based transports.
 In the future, we will further improve TTFT through GPUDirect RDMA and zero-copy.

From 0009f8227fea794f86110f7400bd849c3ea48f5d Mon Sep 17 00:00:00 2001
From: Shangming Cai <caishangming@linux.alibaba.com>
Date: Wed, 4 Dec 2024 10:12:40 +0800
Subject: [PATCH 3/6] docs: update filename of vllm integration guide v2.

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
---
 README.md                                                     | 4 ++--
 ...lm-integration-new.md => vllm-integration-v0.2-nightly.md} | 0
 2 files changed, 2 insertions(+), 2 deletions(-)
 rename doc/en/{vllm-integration-new.md => vllm-integration-v0.2-nightly.md} (100%)

diff --git a/README.md b/README.md
index 415d386..a6c7410 100644
--- a/README.md
+++ b/README.md
@@ -67,12 +67,12 @@ Thanks to the high performance of Transfer Engine, P2P Stores can also distribut
 
 ![p2p-store.gif](image/p2p-store.gif)
 
-### vLLM Integration ([Guide v0.1](doc/en/vllm-integration.md), [v0.2-Nightly](doc/en/vllm-integration-new.md))
+### vLLM Integration ([Guide v0.1](doc/en/vllm-integration.md), [v0.2-Nightly](doc/en/vllm-integration-v0.2-nightly.md))
 To optimize LLM inference, the vLLM's community is working at supporting [disaggregated prefilling (PR 8498)](https://github.com/vllm-project/vllm/pull/8498). This feature allows separating the **prefill** phase from the **decode** phase in different processes. The vLLM uses `nccl` and `gloo` as the transport layer by default, but currently it cannot efficiently decouple both phases in different machines.
 
 We have implemented vLLM integration, which uses Transfer Engine as the network layer instead of `nccl` and `gloo`, to support **inter-node KVCache transfer**. Transfer Engine provides simpler interface and more efficient use of RDMA devices. In the future, we plan to build Mooncake Store on the basis of Transfer Engine, which supports pooled prefill/decode disaggregation.
 
-**_Update[Dec 3, 2024]: Here is the nightly vLLM Integration ([Guide v0.2-Nightly](doc/en/vllm-integration-new.md)) that is based on vLLM's main branch._**
+**_Update[Dec 3, 2024]: Here is the nightly vLLM Integration ([Guide v0.2-Nightly](doc/en/vllm-integration-v0.2-nightly.md)) that is based on vLLM's main branch._**
 
 #### Performance
 By supporting Topology Aware Path Selection and multi-card bandwidth aggregation, TTFT of vLLM with Transfer Engine is up to 33% lower than traditional TCP-based transports.
diff --git a/doc/en/vllm-integration-new.md b/doc/en/vllm-integration-v0.2-nightly.md
similarity index 100%
rename from doc/en/vllm-integration-new.md
rename to doc/en/vllm-integration-v0.2-nightly.md

From 8d4249e1a5ffed200c882dabf86860e3a9bfc737 Mon Sep 17 00:00:00 2001
From: Shangming Cai <caishangming@linux.alibaba.com>
Date: Wed, 4 Dec 2024 10:16:32 +0800
Subject: [PATCH 4/6] docs: remove stale gif in vllm integration guide v2.

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
---
 doc/en/vllm-integration-v0.2-nightly.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/doc/en/vllm-integration-v0.2-nightly.md b/doc/en/vllm-integration-v0.2-nightly.md
index 330eba7..6eb51a7 100644
--- a/doc/en/vllm-integration-v0.2-nightly.md
+++ b/doc/en/vllm-integration-v0.2-nightly.md
@@ -3,8 +3,6 @@
 ## Overview
 This is the nightly version of mooncake-transfer-engine integration with the vLLM project based on [PR 10502](https://github.com/vllm-project/vllm/pull/10502) (vllm version: v0.6.4.post1) to accelerate KVCache transfer for inter-node disaggregated Prefill/Decode scenario. Benchmark results will be added soon.
 
-![vllm-integration-demo](../../image/vllm-integration-demo.gif)
-
 ## Installation
 ### Prerequisite
 Please install the Mooncake Transfer Engine according to the [instructions](build.md) first.

From 53383d301d8950349562ff5e8b8ea9165ea9651f Mon Sep 17 00:00:00 2001
From: Shangming Cai <caishangming@linux.alibaba.com>
Date: Wed, 4 Dec 2024 10:21:57 +0800
Subject: [PATCH 5/6] docs: add notes in vllm integration guide v2.

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
---
 doc/en/vllm-integration-v0.2-nightly.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/doc/en/vllm-integration-v0.2-nightly.md b/doc/en/vllm-integration-v0.2-nightly.md
index 6eb51a7..085ad8f 100644
--- a/doc/en/vllm-integration-v0.2-nightly.md
+++ b/doc/en/vllm-integration-v0.2-nightly.md
@@ -1,7 +1,9 @@
 # vLLM Disaggregated Prefill/Decode Demo
 
 ## Overview
-This is the nightly version of mooncake-transfer-engine integration with the vLLM project based on [PR 10502](https://github.com/vllm-project/vllm/pull/10502) (vllm version: v0.6.4.post1) to accelerate KVCache transfer for inter-node disaggregated Prefill/Decode scenario. Benchmark results will be added soon.
+This is the nightly version of mooncake-transfer-engine integration with the vLLM project based on [PR 10502](https://github.com/vllm-project/vllm/pull/10502) (vllm version: v0.6.4.post1/main) to accelerate KVCache transfer for inter-node disaggregated Prefill/Decode scenario. Benchmark results will be added soon.
+
+**_Please note that this is not a fully ready version and will be modified anytime based on feedback from the vLLM community._**
 
 ## Installation
 ### Prerequisite

From 688a3c83f40550e07249939b53a24da0969d01b1 Mon Sep 17 00:00:00 2001
From: Shangming Cai <caishangming@linux.alibaba.com>
Date: Wed, 4 Dec 2024 10:31:52 +0800
Subject: [PATCH 6/6] docs: change date of update note for vllm integration v2.

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index a6c7410..b41629c 100644
--- a/README.md
+++ b/README.md
@@ -72,7 +72,7 @@ To optimize LLM inference, the vLLM's community is working at supporting [disagg
 
 We have implemented vLLM integration, which uses Transfer Engine as the network layer instead of `nccl` and `gloo`, to support **inter-node KVCache transfer**. Transfer Engine provides simpler interface and more efficient use of RDMA devices. In the future, we plan to build Mooncake Store on the basis of Transfer Engine, which supports pooled prefill/decode disaggregation.
 
-**_Update[Dec 3, 2024]: Here is the nightly vLLM Integration ([Guide v0.2-Nightly](doc/en/vllm-integration-v0.2-nightly.md)) that is based on vLLM's main branch._**
+**_Update[Dec 4, 2024]: Here is the nightly vLLM Integration ([Guide v0.2-Nightly](doc/en/vllm-integration-v0.2-nightly.md)) that is based on vLLM's main branch._**
 
 #### Performance
 By supporting Topology Aware Path Selection and multi-card bandwidth aggregation, TTFT of vLLM with Transfer Engine is up to 33% lower than traditional TCP-based transports.