Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Frontend][Doc] Initial support for LLaVA-NeXT and GPT-4V Chat Completions API #3978

Closed
wants to merge 60 commits into from
Closed

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Apr 10, 2024

To combat scope creep, this PR has been split into smaller ones.

The branch associated with this PR has been frozen (except for critical fixes). Once all dependencies have been merged, I will compare this branch against the merged (main) branch to verify that I didn't miss any changes.

- Refactor `OpenAIServingChat` and add function for loading image
- Move `pillow` dev dependency to common
- Add example chat template for LLaVA model
- Add general guide for using VLMs
- Add LLavA to list of supported models
- Move `ServerRunner` to common file
@DarkLight1337 DarkLight1337 changed the title [Doc][Frontend] Extexnd OpenAI-compatible server to support GPT-4V Chat Completions API [Doc][Frontend] Support GPT-4V Chat Completions API Apr 10, 2024
@ywang96 ywang96 self-assigned this Apr 10, 2024
- Remove channel conversion and resizing from OpenAI server preprocessing since the image processor in HuggingFace should be able to handle that
- `MultiModalData` is now an abstract class that outputs additional kwargs to be input into the model. This was intially done to support LLaVA-NeXT's `image_size` parameter but can be extended to other models as well.
- The application of image processor is now defined inside `MultiModalData` so that there is no need to extensively edit the engine to support other types of data
- New `MultiModalData` subclasses: `ImagePixelData` and `ImageFeatureData` to better differentiate the two cases of image input
- Refactored LLaVA-1.5 model to make it easier to inherit for defining LLaVA-NeXT model
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Apr 18, 2024

@ywang96 Regarding your latest comment in #3042:

I'm working on a RFC to share some thoughts for refactoring and will send out tomorrow.

Actually, I have been working on supporting LLaVA-NeXT as well. As part of the effort, I have further refactored the image processing pipeline to output a dictionary which are used to input kwargs into the model in order to accept image_size. This preserves the contract between the output of HuggingFace processor and the input into the HuggingFace model. As long as those keyword arguments do not conflict with the ones we have in vLLM, I think this is a good way to make the framework flexible enough to support other multi-modal architectures.

@ywang96
Copy link
Member

ywang96 commented Apr 18, 2024

As part of the effort, I have further refactored the image processing pipeline to output a dictionary which are used to input kwargs into the model in order to accept image_size.

Yep - this is exactly what I had in my mind as well, but I think there are more issues in addition to it that we may want to address. For example, do we want to support the prompt format the same way as huggingface to make user experience easier, or at least keep it the same on the Interface level? I do think these are worth discussing (and the refactoring can happen later after we merge model support and API server support depending on how much work it will be).

- Now, `ImagePixelData` only accepts `PIL.Image` input
- Also move `torch` import out of `TYPE_CHECKING` as it is loaded anyways when importing `SamplingParams`
- Note the patch in `ImagePixelData`. To fully leverage the potential of LLaVA-Next, we should allow image of any size, but the feature size would then be variable.
@DarkLight1337 DarkLight1337 changed the title [Core][Frontend][Doc] Support image processing for VLMs and GPT-4V Chat Completions API [Core][Frontend][Doc] Improved support for VLMs and add GPT-4V Chat Completions API Apr 18, 2024
@DarkLight1337 DarkLight1337 changed the title [Core][Frontend][Doc] Improved support for VLMs and add GPT-4V Chat Completions API [Core][Frontend][Doc] Initial support for LLaVA-NeXT and add GPT-4V Chat Completions API Apr 18, 2024
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Apr 18, 2024

I have just added support for LLaVA-NeXT, with one big caveat: the size of the input image is fixed, otherwise the feature size (i.e. number of <image> tokens to duplicate) would vary depending on the runtime input. This prevents us from taking full advantage of the extra resolution. Still, this provides us access to a 34b model which should improve over their 7b and 13b LLaVA-1.5 models.

@DarkLight1337 DarkLight1337 changed the title [Core][Frontend][Doc] Initial support for LLaVA-NeXT and add GPT-4V Chat Completions API [Core][Frontend][Doc] Initial support for LLaVA-NeXT and GPT-4V Chat Completions API Apr 18, 2024
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Apr 19, 2024

These force pushes consolidate the fixes to the LLaVA test and example code.

- Note that we now load the images directly instead of from `.pt` files
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Apr 19, 2024

@ywang96 I think that this PR is suffering from scope creep. Perhaps I should break apart the changes into smaller segments to facilitate the conversation in #4194? I could split the changes as follows, with each item being its own PR:

  1. VLM backend
    a. Refactor MultiModalData to support image processing; refactor LLaVA-1.5 accordingly. ([Core] Support image processor #4197)
    b. Introduce LLaVA-NexT along with the refactored LLaVA-1.5 ([Model] Initial support for LLaVA-NeXT #4199) [depends on 1(a)]
  2. OpenAI API server
    a. Refactor OpenAPI backend ([Frontend] Refactor prompt processing #4028)
    b. Add GPT-4V support and provide LLaVA chat template ([Frontend] Support GPT-4V Chat Completions API #4200) [depends on 1(a) 2(a)]

Edit: Added links to the child PRs.

@ywang96
Copy link
Member

ywang96 commented Apr 19, 2024

@ywang96 I think that this PR is suffering from scope creep. Perhaps I should break apart the changes into smaller segments to facilitate the conversation in #4194? I could split the changes as follows (listed in the form of a dependency tree):

  1. VLM backend
    a. Refactor MultiModalData to support image processing.
    b. Introduce LLaVA-NexT along with the refactored LLaVA-1.5
  2. OpenAI API server
    a. Refactor OpenAPI backend (i.e. Support VLM model and GPT4V API #2058)
    b. Add GPT-4V support [also depends on 1(a)]

I agree - I think OpenAI API server will be a good starting point since the interface should agree with OpenAI protocol anyways, and I'm sorry that this PR suffered :/

One suggestion I have is for a big change like this - it's probably good to have a series of PRs anyways. Take a look at Speculative decoding or Chunked Prefill - those are great examples.

@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Apr 19, 2024

I have created the child PRs.

- These changes are propagated to the child PRs
@DarkLight1337
Copy link
Member Author

All of the child PRs have been completed, so I'm closing this now.

@DarkLight1337 DarkLight1337 deleted the openai-vision-api branch June 10, 2024 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants