docs: add instructions for Langchain #1162

mspronesti · 2023-09-23T17:37:22Z

Hi! A month ago I made a few contributions to Langchain to support vLLM and its OpenAI-compatible server.

This PR aims at updating vLLM's documentation to showcase how to use vLLM with Langchain and to redirect to the tutorial I wrote there for further details. Hope you find this contribution meaningful.

pvtoan · 2023-11-02T09:10:24Z

Hi @mspronesti , does this LangChain-VLLM support quantized model?

Because the vllm-project already supported quantized model (AWQ format) as shown in #1032

However, when I use the same way and just pass "quantization='awq" to your LangChain-VLLM, it seems does not work and just show OOM.

model_path = "/home/quadrep/toan/projects/LLMs/weights/vicuna-33B-AWQ"
model = VLLM(model=model_path, tensor_parallel_size=2, trust_remote_code=True, max_new_tokens=512, quantization='awq')
--> Error: torch.cuda.OutOfMemoryError: CUDA out of memory.

mspronesti · 2023-11-02T10:12:38Z

Hi @pvtoan, quantization is not an explicit parameter of langchain's vLLM. You need to pass it as follows

model = VLLM(
  model=model_path, 
  tensor_parallel_size=2, 
  trust_remote_code=True, 
  max_new_tokens=512, 
  vllm_kwargs={"quantization": "awq"}
)

Also, notice that tensor_parallel_size=2 implies you want to serve the model in a distributed manner, with 2 GPUs. Hope this helps :)

mspronesti · 2023-11-02T10:37:45Z

Btw @WoosukKwon I have also resolved the conflicts in this PR, in case you are interested in merging It.

pvtoan · 2023-11-03T01:56:12Z

Hi @mspronesti , yes, it works now and I indeed uses 2 GPUs.

Thank you so much for your help!.

Besides, I'd like to ask if I have any problems when loading the model vicuna-33B-AWQ (model size is around, in total, 18GB with 2 shards) and one embedding model (size about 1.3GB) into 2 GPUs 4090 (each with 24GB) because all loaded models almost occupied 48GB from 2 GPUs.

In particular, two GPUs increased memory simultaneously from 0GB to around 9GB, then growed up to 14GB, and finally to 23.9GB.

My DRAM also increases from 6GB to around 28GB. My OS is Ubuntu 22.04.

Do you think it is normal for occupied memory including VRAM and DRAM?

simon-mo · 2023-11-30T18:57:49Z

Thank you for your contribution!

docs: add instruction for langchain

7094f2e

mspronesti changed the title ~~docs: add instruction for langchain~~ docs: add instructions for Langchain Sep 23, 2023

Merge branch 'main' into docs/langchain

416e8f1

zhuohan123 added the documentation Improvements or additions to documentation label Nov 21, 2023

mspronesti mentioned this pull request Nov 26, 2023

autoawq for vllm langchain-ai/langchain#13343

Closed

simon-mo approved these changes Nov 30, 2023

View reviewed changes

simon-mo merged commit 05a3861 into vllm-project:main Nov 30, 2023

xjpang pushed a commit to xjpang/vllm that referenced this pull request Dec 4, 2023

docs: add instruction for langchain (vllm-project#1162)

95ae62c

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

docs: add instruction for langchain (vllm-project#1162)

aa7bc61

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add instructions for Langchain #1162

docs: add instructions for Langchain #1162

mspronesti commented Sep 23, 2023

pvtoan commented Nov 2, 2023

mspronesti commented Nov 2, 2023

mspronesti commented Nov 2, 2023

pvtoan commented Nov 3, 2023

simon-mo commented Nov 30, 2023

docs: add instructions for Langchain #1162

docs: add instructions for Langchain #1162

Conversation

mspronesti commented Sep 23, 2023

pvtoan commented Nov 2, 2023

mspronesti commented Nov 2, 2023

mspronesti commented Nov 2, 2023

pvtoan commented Nov 3, 2023

simon-mo commented Nov 30, 2023