Llama3.2 vision model support #1551

hnyls2002 · 2024-10-01T06:45:43Z

Motivation

Support encoder-decoder architecture in SGLang.
Support llama vision model.
Support CUDA graph and prefix cache for llama vision model

Note that to support CUDA graph for encoder-decoder architecture like llama vision (mllama), we should make encoder_lens the part of the cuda graph, as the full_text_row_masked_out_mask is decided by encoder_lens to skip the text-only req in a mixed batch.

However, the current cuda graph backend (flashinfer) seems to have trouble handling the mixed batch. So we for now only accept the pure image decoding batch.

Todo in the following PRs:

Split attention backends: sliding_window, single_attention, cross_attention
Optimize encoder cache locations indexing, and reduce memory usage.

Modifications

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

python/pyproject.toml

python/sglang/srt/layers/attention/triton_backend.py

python/sglang/srt/mem_cache/memory_pool.py

python/sglang/srt/models/qwen2_vl.py

hcyz33 · 2024-12-04T08:31:14Z

python/sglang/srt/mem_cache/memory_pool.py

@@ -154,7 +160,7 @@ def get_kv_buffer(self, layer_id: int) -> Tuple[torch.Tensor, torch.Tensor]:

    def set_kv_buffer(
        self,
-        layer_id: int,
+        layer: RadixAttention,


Why is it necessary to change the data type from int to RadixAttention here?

hnyls2002 added 12 commits September 25, 2024 21:41

format toml

bf67d69

hf_config -> hf_text_config

9e072e5

copy code

3c9e99d

replace some components

265c202

make it compatible only with text model

3125f90

Merge branch 'main' into llama-3.2

8c5c713

Merge branch 'main' into llama-3.2

716a163

handle image inputs

3d8c592

Merge branch 'main' into llama-3.2

44ec303

align with main

2092617

Merge branch 'main' into llama-3.2

d4a8b2b

Merge branch 'main' into llama-3.2

bfb41c3

hnyls2002 marked this pull request as draft October 1, 2024 06:46

Merge branch 'main' into llama-3.2

7f23930

hnyls2002 force-pushed the llama-3.2 branch 2 times, most recently from 00cd46a to 2aebd9f Compare October 1, 2024 08:12

handle encoder_lens and wrapper init

64a5ebc

hnyls2002 force-pushed the llama-3.2 branch from 2aebd9f to 64a5ebc Compare October 1, 2024 08:20

antinucleon reviewed Oct 6, 2024

View reviewed changes

python/pyproject.toml Outdated Show resolved Hide resolved

merrymercy mentioned this pull request Oct 6, 2024

[Feature] add support for llama 3.2 #1523

Closed

5 tasks

merrymercy added the high priority label Oct 12, 2024

Merge branch 'main' into llama-3.2

490160c

hnyls2002 force-pushed the llama-3.2 branch from 8b063e0 to 490160c Compare October 13, 2024 22:11

fix encoder_lens and other bugs

8bca63e

hnyls2002 force-pushed the llama-3.2 branch from 4a02347 to 8bca63e Compare October 13, 2024 23:32

hnyls2002 added 3 commits October 14, 2024 05:24

handle unified input

2571c85

fix

d42a7dc

support llama3.2

25b6720

merrymercy mentioned this pull request Oct 17, 2024

Add GLM-4v Multimodal Model support for SGLang #1641

Closed

3 tasks

zhyncs mentioned this pull request Oct 17, 2024

Development Roadmap (2024 Q4) #1487

Open

37 tasks

hnyls2002 added 9 commits October 20, 2024 21:16

fix

0a7c8f0

fix

d127e15

support cuda graph

46693e3

fix qwen vl2

67c7d98

fix merge and filter batch

9db8e81

fix cuda graph

9edcfb0

fix

71be322

fix

100befd

Merge branch 'main' into llama-3.2

e306245

hnyls2002 marked this pull request as ready for review October 21, 2024 03:52

hnyls2002 added 10 commits October 21, 2024 04:00

fix

b9a33db

fix cached

2e5bc99

fix encoder_len fill value

2e1f1e1

add ci test

21f43ed

update

3cbc880

fix

fdec343

fix ci test assert

9c5fa9a

Merge branch 'main' into llama-3.2

a0a2776

fix backend update

9fe933d

fix sum

a1f1875

merrymercy reviewed Oct 21, 2024

View reviewed changes

python/sglang/srt/layers/attention/triton_backend.py Show resolved Hide resolved

python/sglang/srt/mem_cache/memory_pool.py Outdated Show resolved Hide resolved

python/sglang/srt/models/qwen2_vl.py Show resolved Hide resolved

reduce runtime if

f25fe9d

hnyls2002 requested review from Ying1123, zhyncs, ispobock and ByronHsu as code owners October 21, 2024 21:14

hnyls2002 merged commit 94cde10 into main Oct 21, 2024
11 checks passed

hnyls2002 deleted the llama-3.2 branch October 21, 2024 22:01

hcyz33 reviewed Dec 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama3.2 vision model support #1551

Llama3.2 vision model support #1551

hnyls2002 commented Oct 1, 2024 •

edited

Loading

hcyz33 Dec 4, 2024

Llama3.2 vision model support #1551

Llama3.2 vision model support #1551

Conversation

hnyls2002 commented Oct 1, 2024 • edited Loading

Motivation

Modifications

Checklist

hcyz33 Dec 4, 2024

Choose a reason for hiding this comment

hnyls2002 commented Oct 1, 2024 •

edited

Loading