Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama3.2 vision model support #1551

Merged
merged 48 commits into from
Oct 21, 2024
Merged

Llama3.2 vision model support #1551

merged 48 commits into from
Oct 21, 2024

Conversation

hnyls2002
Copy link
Collaborator

@hnyls2002 hnyls2002 commented Oct 1, 2024

Motivation

  • Support encoder-decoder architecture in SGLang.
  • Support llama vision model.
  • Support CUDA graph and prefix cache for llama vision model

Note that to support CUDA graph for encoder-decoder architecture like llama vision (mllama), we should make encoder_lens the part of the cuda graph, as the full_text_row_masked_out_mask is decided by encoder_lens to skip the text-only req in a mixed batch.

However, the current cuda graph backend (flashinfer) seems to have trouble handling the mixed batch. So we for now only accept the pure image decoding batch.

Todo in the following PRs:

  • Split attention backends: sliding_window, single_attention, cross_attention
  • Optimize encoder cache locations indexing, and reduce memory usage.

Modifications

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@hnyls2002 hnyls2002 marked this pull request as draft October 1, 2024 06:46
@hnyls2002 hnyls2002 force-pushed the llama-3.2 branch 2 times, most recently from 00cd46a to 2aebd9f Compare October 1, 2024 08:12
python/pyproject.toml Outdated Show resolved Hide resolved
@hnyls2002 hnyls2002 marked this pull request as ready for review October 21, 2024 03:52
@hnyls2002 hnyls2002 merged commit 94cde10 into main Oct 21, 2024
11 checks passed
@hnyls2002 hnyls2002 deleted the llama-3.2 branch October 21, 2024 22:01
@@ -154,7 +160,7 @@ def get_kv_buffer(self, layer_id: int) -> Tuple[torch.Tensor, torch.Tensor]:

def set_kv_buffer(
self,
layer_id: int,
layer: RadixAttention,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to change the data type from int to RadixAttention here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants