-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Enable optional prefix when loading embedding models #10639
[Model] Enable optional prefix when loading embedding models #10639
Conversation
Signed-off-by: DarkLight1337 <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Is there any code example to follow to get embeddings? |
@DarkLight1337 It gives me following error: RuntimeError: stack expects each tensor to be equal size, but got [6, 32000] at entry 0 and [8, 32000] at entry 1 |
Can you open a new issue and provide more details there? |
I opened a new issue #10673 Please check. |
…oject#10639) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Andrew Feldman <[email protected]>
Hi @DarkLight1337 |
You can try the Dockerfile in this section |
docker compose
error output
I had to use the latest ✅ commit from main since yours (fe25236) was not able to download Also i had to use commands with To my knowledge the embedding model i use (https://huggingface.co/HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1) is trained from Qwen/Qwen2-0.5B |
Can you go into the Docker container and show the output of |
Output
I did not add a gpu on purpose since the goal is cpu embedding serving. |
@Isotr0py can you help look into this? |
@githebs Seems that the docker image is not cpu installation, you need to use a cpu installation image built from |
…oject#10639) Signed-off-by: DarkLight1337 <[email protected]>
I have no idea where to get that for the latest releases since the doc states "vLLM provides wheelsfor Linux running on a x86 platform with CUDA 12" so CPU is out of the equation there. (unless full build) Is there a timeline for a new release with that fix ? And if so is this embedding model, derived from Qwen, supported ? Thanks |
Hmmm, we only release pre-built wheel for GPU currently... So if you want to serve models with CPU, you need to use CPU docker image (this is always synced with latest commit) or build CPU backend from source manually (this should be fast within ~5min). |
…oject#10639) Signed-off-by: DarkLight1337 <[email protected]>
Some embedding models use the checkpoint of
*ForCausalLM
, while others use*Model
, yet their architecture names might not always match the expected weights. To improve flexibility, this PR enables loading embedding models (*EmbeddingModel
in vLLM) using the weights of either checkpoint format.FIX #10193 (comment)