-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Add multi-image support for minicpmv offline inference #7122
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
@DarkLight1337 Do you have time to review this? See if the modifications is fine for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments to clean up the code.
Co-authored-by: Cyrus Leung <[email protected]>
We can merge this after you have added tests that check the model's behaviour for multi-image input. |
/ready |
Can you move |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be slight differences compared to the HF version but overall the results are reasonable.
Going to make a change on the title to indicate this only works for offline inference. Thank you for the PR! |
Thanks for implementing this! Does this support interleaved text-image reasoning? |
Of course! Just do not use chat_template and construct related prompt. |
Co-authored-by: hezhihui <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: hezhihui <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
support video inference with vllm openai server ? if that, can show some example? |
Not supported yet. It will be addressed in another PR. |
For now you can pass in a video via multi-image input. We already have an example of this in the docs about VLMs. |
Co-authored-by: hezhihui <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: Alvant <[email protected]>
Co-authored-by: hezhihui <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
Since it just need few changes outside
minicpmv.py
, I add multi-images support first for it. And I'm doing more tests for it.You can use multi-image inputs as follow: