[Core][Frontend][Doc] Initial support for LLaVA-NeXT and GPT-4V Chat Completions API #3978

DarkLight1337 · 2024-04-10T17:35:49Z

To combat scope creep, this PR has been split into smaller ones.

VLM backend
- [Core] Support image processor #4197
- [Model] Initial support for LLaVA-NeXT #4199
OpenAI API server
- ~~[Frontend] Support GPT-4V Chat Completions API #4200~~ [Frontend] Add OpenAI Vision API Support #5237
QA
- [Bugfix] Fix LLaVA-NeXT #5380

The branch associated with this PR has been frozen (except for critical fixes). Once all dependencies have been merged, I will compare this branch against the merged (main) branch to verify that I didn't miss any changes.

- Refactor `OpenAIServingChat` and add function for loading image - Move `pillow` dev dependency to common - Add example chat template for LLaVA model

- Add general guide for using VLMs - Add LLavA to list of supported models

- Move `ServerRunner` to common file

- Incorrect loading of config (also rename `openai_api` to `image_openai`) - Incorrect await of stream generator

- Also, use the type definitions from `openai` directly

…ions API and legacy Completions API

- Remove channel conversion and resizing from OpenAI server preprocessing since the image processor in HuggingFace should be able to handle that - `MultiModalData` is now an abstract class that outputs additional kwargs to be input into the model. This was intially done to support LLaVA-NeXT's `image_size` parameter but can be extended to other models as well. - The application of image processor is now defined inside `MultiModalData` so that there is no need to extensively edit the engine to support other types of data - New `MultiModalData` subclasses: `ImagePixelData` and `ImageFeatureData` to better differentiate the two cases of image input - Refactored LLaVA-1.5 model to make it easier to inherit for defining LLaVA-NeXT model

DarkLight1337 · 2024-04-18T04:54:46Z

@ywang96 Regarding your latest comment in #3042:

I'm working on a RFC to share some thoughts for refactoring and will send out tomorrow.

Actually, I have been working on supporting LLaVA-NeXT as well. As part of the effort, I have further refactored the image processing pipeline to output a dictionary which are used to input kwargs into the model in order to accept image_size. This preserves the contract between the output of HuggingFace processor and the input into the HuggingFace model. As long as those keyword arguments do not conflict with the ones we have in vLLM, I think this is a good way to make the framework flexible enough to support other multi-modal architectures.

ywang96 · 2024-04-18T05:15:18Z

As part of the effort, I have further refactored the image processing pipeline to output a dictionary which are used to input kwargs into the model in order to accept image_size.

Yep - this is exactly what I had in my mind as well, but I think there are more issues in addition to it that we may want to address. For example, do we want to support the prompt format the same way as huggingface to make user experience easier, or at least keep it the same on the Interface level? I do think these are worth discussing (and the refactoring can happen later after we merge model support and API server support depending on how much work it will be).

- Now, `ImagePixelData` only accepts `PIL.Image` input - Also move `torch` import out of `TYPE_CHECKING` as it is loaded anyways when importing `SamplingParams`

- Note the patch in `ImagePixelData`. To fully leverage the potential of LLaVA-Next, we should allow image of any size, but the feature size would then be variable.

DarkLight1337 · 2024-04-18T09:45:18Z

I have just added support for LLaVA-NeXT, with one big caveat: the size of the input image is fixed, otherwise the feature size (i.e. number of <image> tokens to duplicate) would vary depending on the runtime input. This prevents us from taking full advantage of the extra resolution. Still, this provides us access to a 34b model which should improve over their 7b and 13b LLaVA-1.5 models.

DarkLight1337 · 2024-04-19T07:15:02Z

These force pushes consolidate the fixes to the LLaVA test and example code.

- Note that we now load the images directly instead of from `.pt` files

DarkLight1337 · 2024-04-19T08:02:50Z

@ywang96 I think that this PR is suffering from scope creep. Perhaps I should break apart the changes into smaller segments to facilitate the conversation in #4194? I could split the changes as follows, with each item being its own PR:

VLM backend
a. Refactor MultiModalData to support image processing; refactor LLaVA-1.5 accordingly. ([Core] Support image processor #4197)
b. Introduce LLaVA-NexT ~~along with the refactored LLaVA-1.5~~ ([Model] Initial support for LLaVA-NeXT #4199) [depends on 1(a)]
OpenAI API server
a. Refactor OpenAPI backend ([Frontend] Refactor prompt processing #4028)
b. Add GPT-4V support and provide LLaVA chat template ([Frontend] Support GPT-4V Chat Completions API #4200) [depends on 1(a) 2(a)]

Edit: Added links to the child PRs.

ywang96 · 2024-04-19T08:06:54Z

@ywang96 I think that this PR is suffering from scope creep. Perhaps I should break apart the changes into smaller segments to facilitate the conversation in #4194? I could split the changes as follows (listed in the form of a dependency tree):

VLM backend
a. Refactor MultiModalData to support image processing.
b. Introduce LLaVA-NexT along with the refactored LLaVA-1.5

OpenAI API server
a. Refactor OpenAPI backend (i.e. Support VLM model and GPT4V API #2058)
b. Add GPT-4V support [also depends on 1(a)]

I agree - I think OpenAI API server will be a good starting point since the interface should agree with OpenAI protocol anyways, and I'm sorry that this PR suffered :/

One suggestion I have is for a big change like this - it's probably good to have a series of PRs anyways. Take a look at Speculative decoding or Chunked Prefill - those are great examples.

DarkLight1337 · 2024-04-19T09:25:58Z

I have created the child PRs.

- These changes are propagated to the child PRs

DarkLight1337 · 2024-06-10T15:41:22Z

All of the child PRs have been completed, so I'm closing this now.

DarkLight1337 added 3 commits April 10, 2024 16:31

Add basic support for OpenAI image input API

874a581

- Refactor `OpenAIServingChat` and add function for loading image - Move `pillow` dev dependency to common - Add example chat template for LLaVA model

Update documentation

607434e

- Add general guide for using VLMs - Add LLavA to list of supported models

Add tests for OpenAI image input API and image loader

aaa6bfe

- Move `ServerRunner` to common file

DarkLight1337 changed the title ~~[Doc][Frontend] Extexnd OpenAI-compatible server to support GPT-4V Chat Completions API~~ [Doc][Frontend] Support GPT-4V Chat Completions API Apr 10, 2024

ywang96 self-assigned this Apr 10, 2024

DarkLight1337 added 5 commits April 11, 2024 03:10

Merge branch 'upstream' into openai-vision-api

26e7b2a

Apply formatter

44829b5

Place image before text for llava-hf model

bccb367

Internally enable customization of merging image with text prompt

b9302e8

Fix errors in CI/CD

a44d7d1

- Incorrect loading of config (also rename `openai_api` to `image_openai`) - Incorrect await of stream generator

This was referenced Apr 11, 2024

[Bug]: Model architectures ['LlavaForCausalLM'] are not supported for now in vllm 0.4.0.post1 #4008

Closed

[Frontend] Support complex message content for chat completions endpoint #3467

Merged

DarkLight1337 and others added 18 commits April 12, 2024 02:56

Merge branch 'upstream' into openai-vision-api

561ad49

Fix some type errors along the way

4479605

Improve async behaviour of loading images

20852d9

- Also, use the type definitions from `openai` directly

Use discriminated union in prompt parsing

ce770f4

Fix some type errors along the way

6b016bc

Some more fixes

7620354

Apply formatter

7c3e6d9

Merge branch 'upstream' into openai-vision-api

e74b0a7

Move openai to common requirements

9925dcb

Fix typo in _parse_chat_message_image_input

ceb4e35

Refactor prompt parsing so that it can be shared between Chat Complet…

7bdc84e

…ions API and legacy Completions API

Make code more readable

a7d1098

Move assertion to a more appropriate place

8b9d636

Merge branch 'openai-typing' into openai-vision-api

9754142

Add code documentation

c48c13a

Decompose _validate_prompt_and_tokenize

3530362

Fix missing import due to renaming

b8feec9

Merge branch 'openai-typing' into openai-vision-api

9cae113

DarkLight1337 added 4 commits April 18, 2024 06:17

Fix image processing not working directly, due to tensor being passed

483b190

- Now, `ImagePixelData` only accepts `PIL.Image` input - Also move `torch` import out of `TYPE_CHECKING` as it is loaded anyways when importing `SamplingParams`

Merge branch 'upstream' into openai-vision-api

3e22017

Revert to using 7b model in testing

0b6af35

Get LLaVA-Next to work with fixed-size images

e4c3502

- Note the patch in `ImagePixelData`. To fully leverage the potential of LLaVA-Next, we should allow image of any size, but the feature size would then be variable.

DarkLight1337 changed the title ~~[Core][Frontend][Doc] Support image processing for VLMs and GPT-4V Chat Completions API~~ [Core][Frontend][Doc] Improved support for VLMs and add GPT-4V Chat Completions API Apr 18, 2024

DarkLight1337 changed the title ~~[Core][Frontend][Doc] Improved support for VLMs and add GPT-4V Chat Completions API~~ [Core][Frontend][Doc] Initial support for LLaVA-NeXT and add GPT-4V Chat Completions API Apr 18, 2024

Apply formatter and fix typo

21aaf3d

DarkLight1337 changed the title ~~[Core][Frontend][Doc] Initial support for LLaVA-NeXT and add GPT-4V Chat Completions API~~ [Core][Frontend][Doc] Initial support for LLaVA-NeXT and GPT-4V Chat Completions API Apr 18, 2024

DarkLight1337 added 4 commits April 18, 2024 12:54

Fix input shape not being based on config value

ac95b79

Allow config to specify other image size for LLaVA-NeXT

9a9a4e7

Improve error message to show the expected image_feature_size

176ad2c

Fix dtype mismatch in multi_modal_kwargs

91ea044

Fix LLaVA example and test w.r.t. image processing refactor

cb19743

- Note that we now load the images directly instead of from `.pt` files

ywang96 mentioned this pull request Apr 19, 2024

[RFC]: Multi-modality Support on vLLM #4194

Open

86 tasks

Merge branch 'upstream' into openai-vision-api

019f473

This was referenced Apr 19, 2024

[Core] Support image processor #4197

Merged

[Model] Initial support for LLaVA-NeXT #4199

Merged

DarkLight1337 mentioned this pull request Apr 19, 2024

[Frontend] Support GPT-4V Chat Completions API #4200

Closed

Fix circular import and set return type

f882d99

- These changes are propagated to the child PRs

DarkLight1337 closed this Jun 10, 2024

DarkLight1337 deleted the openai-vision-api branch June 10, 2024 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Frontend][Doc] Initial support for LLaVA-NeXT and GPT-4V Chat Completions API #3978

[Core][Frontend][Doc] Initial support for LLaVA-NeXT and GPT-4V Chat Completions API #3978

DarkLight1337 commented Apr 10, 2024 •

edited

Loading

DarkLight1337 commented Apr 18, 2024 •

edited

Loading

ywang96 commented Apr 18, 2024

DarkLight1337 commented Apr 18, 2024 •

edited

Loading

DarkLight1337 commented Apr 19, 2024 •

edited

Loading

DarkLight1337 commented Apr 19, 2024 •

edited

Loading

ywang96 commented Apr 19, 2024 •

edited

Loading

DarkLight1337 commented Apr 19, 2024 •

edited

Loading

DarkLight1337 commented Jun 10, 2024

[Core][Frontend][Doc] Initial support for LLaVA-NeXT and GPT-4V Chat Completions API #3978

[Core][Frontend][Doc] Initial support for LLaVA-NeXT and GPT-4V Chat Completions API #3978

Conversation

DarkLight1337 commented Apr 10, 2024 • edited Loading

DarkLight1337 commented Apr 18, 2024 • edited Loading

ywang96 commented Apr 18, 2024

DarkLight1337 commented Apr 18, 2024 • edited Loading

DarkLight1337 commented Apr 19, 2024 • edited Loading

DarkLight1337 commented Apr 19, 2024 • edited Loading

ywang96 commented Apr 19, 2024 • edited Loading

DarkLight1337 commented Apr 19, 2024 • edited Loading

DarkLight1337 commented Jun 10, 2024

DarkLight1337 commented Apr 10, 2024 •

edited

Loading

DarkLight1337 commented Apr 18, 2024 •

edited

Loading

DarkLight1337 commented Apr 18, 2024 •

edited

Loading

DarkLight1337 commented Apr 19, 2024 •

edited

Loading

DarkLight1337 commented Apr 19, 2024 •

edited

Loading

ywang96 commented Apr 19, 2024 •

edited

Loading

DarkLight1337 commented Apr 19, 2024 •

edited

Loading