Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add instructions for Langchain #1162

Merged
merged 2 commits into from
Nov 30, 2023

Conversation

mspronesti
Copy link
Contributor

Hi! A month ago I made a few contributions to Langchain to support vLLM and its OpenAI-compatible server.

This PR aims at updating vLLM's documentation to showcase how to use vLLM with Langchain and to redirect to the tutorial I wrote there for further details. Hope you find this contribution meaningful.

@mspronesti mspronesti changed the title docs: add instruction for langchain docs: add instructions for Langchain Sep 23, 2023
@pvtoan
Copy link

pvtoan commented Nov 2, 2023

Hi @mspronesti , does this LangChain-VLLM support quantized model?

Because the vllm-project already supported quantized model (AWQ format) as shown in #1032

However, when I use the same way and just pass "quantization='awq" to your LangChain-VLLM, it seems does not work and just show OOM.

model_path = "/home/quadrep/toan/projects/LLMs/weights/vicuna-33B-AWQ"
model = VLLM(model=model_path, tensor_parallel_size=2, trust_remote_code=True, max_new_tokens=512, quantization='awq')
--> Error: torch.cuda.OutOfMemoryError: CUDA out of memory.

@mspronesti
Copy link
Contributor Author

Hi @pvtoan, quantization is not an explicit parameter of langchain's vLLM. You need to pass it as follows

model = VLLM(
  model=model_path, 
  tensor_parallel_size=2, 
  trust_remote_code=True, 
  max_new_tokens=512, 
  vllm_kwargs={"quantization": "awq"}
)

Also, notice that tensor_parallel_size=2 implies you want to serve the model in a distributed manner, with 2 GPUs. Hope this helps :)

@mspronesti
Copy link
Contributor Author

Btw @WoosukKwon I have also resolved the conflicts in this PR, in case you are interested in merging It.

@pvtoan
Copy link

pvtoan commented Nov 3, 2023

Hi @mspronesti , yes, it works now and I indeed uses 2 GPUs.

Thank you so much for your help!.

Besides, I'd like to ask if I have any problems when loading the model vicuna-33B-AWQ (model size is around, in total, 18GB with 2 shards) and one embedding model (size about 1.3GB) into 2 GPUs 4090 (each with 24GB) because all loaded models almost occupied 48GB from 2 GPUs.

In particular, two GPUs increased memory simultaneously from 0GB to around 9GB, then growed up to 14GB, and finally to 23.9GB.

My DRAM also increases from 6GB to around 28GB. My OS is Ubuntu 22.04.

Do you think it is normal for occupied memory including VRAM and DRAM?

@zhuohan123 zhuohan123 added the documentation Improvements or additions to documentation label Nov 21, 2023
@simon-mo simon-mo merged commit 05a3861 into vllm-project:main Nov 30, 2023
@simon-mo
Copy link
Collaborator

Thank you for your contribution!

xjpang pushed a commit to xjpang/vllm that referenced this pull request Dec 4, 2023
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants