Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sgl project main #10

Merged
merged 99 commits into from
Oct 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
99 commits
Select commit Hold shift + click to select a range
01fdb2f
Fix test_vision_openai_server on CI (#1620)
ByronHsu Oct 10, 2024
e11ab79
[Performance, hardware] MoE tuning update to AMD MI300x GPUs (#1619)
HaiShaw Oct 11, 2024
c9e6658
Update README.md (#1625)
kushal34712 Oct 11, 2024
b040ed7
Update README.md (#1629)
merrymercy Oct 11, 2024
5476cca
Update README.md
merrymercy Oct 11, 2024
8275049
Add device support (#1607)
liangan1 Oct 11, 2024
58093b8
Nit about the decorator of `PortArgs.init_new` (#1611)
glen-amd Oct 11, 2024
b503881
[Bug] Fix the Image Input of Batch Generation (#1579)
OBJECT907 Oct 11, 2024
bbd72bf
Add the ability to enable and disable the Profiler via HTTP API. (#1626)
Abatom Oct 11, 2024
aba9eae
Fix the correctness test in bench_latency.py when tp > 1 and test_gen…
merrymercy Oct 11, 2024
f13d86f
Add image_token in conversation.py (#1632)
merrymercy Oct 11, 2024
81c3327
Added a "Back To Top" Button (#1633)
JanumalaAkhilendra Oct 11, 2024
5d09ca5
Fix constrained decoding (#1634)
merrymercy Oct 11, 2024
23cc66f
Add back data parallelism (#1635)
merrymercy Oct 11, 2024
00c7e63
Release v0.3.3.post1 (#1636)
merrymercy Oct 11, 2024
862cd26
[engine] support async and streaming (#1614)
ByronHsu Oct 11, 2024
dafb6a5
[Fix] Fix the style of test_large_max_new_tokens.py (#1638)
merrymercy Oct 11, 2024
1d9deea
fix missing ignore_eos in v1/chat/completions (#1642)
learninmou Oct 12, 2024
e37cdab
Fix ignore_eos (#1645)
merrymercy Oct 12, 2024
5d638c9
[Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch (#1480)
liangan1 Oct 12, 2024
69aa937
Fix unit tests and type annotations (#1648)
merrymercy Oct 12, 2024
9da5a60
Add an option to disable penalizer (#1651)
merrymercy Oct 13, 2024
31fad29
Add get_tokenizer function for Engine class (#1653)
pjyi2147 Oct 13, 2024
9610fcd
Fix the batch_is_full check for jump-forward decoding (#1654)
merrymercy Oct 13, 2024
7ee6c25
Simplify the event loop and expose `--num-continuous-decode-steps` as…
merrymercy Oct 13, 2024
c3f2fc5
[doc] Add engine section in backend.md (#1656)
ByronHsu Oct 13, 2024
4876117
[Fix] fix eos trim inconsistency (#1650)
Ying1123 Oct 13, 2024
da1ffed
Add output_ids into ScheduleBatch (#1659)
merrymercy Oct 14, 2024
2725f8d
[Minor] Rename no_eos_trim to no_stop_trim (#1661)
Ying1123 Oct 14, 2024
869f1c0
Add a test case to test retract (#1662)
merrymercy Oct 14, 2024
0c1e879
Move filter_batch out of stream_output (#1663)
merrymercy Oct 14, 2024
061e546
Support double sparsity (#1459)
andy-yang-1 Oct 14, 2024
6790240
Fix unit test order to balance the tasks in CI (#1665)
merrymercy Oct 14, 2024
24f3e15
[Minor] Improve style (#1666)
merrymercy Oct 14, 2024
02bc957
Simplify chunked prefill (#1667)
merrymercy Oct 14, 2024
56503d9
[1/N] Remove `CacheConfig` import in all model files (#1658)
ByronHsu Oct 14, 2024
cd0be74
[doc] improve engine doc and add to readme (#1670)
ByronHsu Oct 15, 2024
4a292f6
[Minor] Add some utility functions (#1671)
merrymercy Oct 15, 2024
175afed
Improve benchmark scripts (#1672)
merrymercy Oct 15, 2024
f1088e0
Fix memory leak during abort (#1674)
merrymercy Oct 15, 2024
b6b4094
Fix filter_batch function call (#1681)
hnyls2002 Oct 16, 2024
a5114b6
Add OLMo model (#1676)
janimo Oct 16, 2024
9116b28
Add a new event loop (#1677)
merrymercy Oct 16, 2024
d10b933
Fix srt dependency (#1685)
ispobock Oct 16, 2024
e4b367b
[Event] Add online meetup meeting link (#1686)
Ying1123 Oct 16, 2024
dbec2f1
Launch a thread to overlap CPU and GPU (#1687)
merrymercy Oct 16, 2024
ecb8bad
Returning a per request metric for number of cached_tokens read (#1599)
havetc Oct 16, 2024
b0facb3
add orjson for jsonresponse (#1688)
michaelfeil Oct 17, 2024
d19cc0b
Update README.md (#1689)
merrymercy Oct 17, 2024
2782132
Add date to logging messages (#1623) (#1679)
zeng-zc Oct 17, 2024
02f7f3e
Update the transformers version in CI (#1690)
merrymercy Oct 17, 2024
5ab20cc
Use SGLang imports for linear layer (#1696)
janimo Oct 17, 2024
b170930
feat: radix tree code optimize (#1697)
wxsms Oct 17, 2024
e5db40d
ORJson. Faster Json serialization (#1694)
michaelfeil Oct 17, 2024
30ee363
Fix the failed unit tests (#1699)
merrymercy Oct 17, 2024
7feba41
Fix failed ci tests on long prompts; Better error messages for embedd…
merrymercy Oct 17, 2024
dd3809f
Fix engine unit test (#1701)
merrymercy Oct 17, 2024
d17d19e
Fix mixed batch for multi modal models (#1702)
merrymercy Oct 17, 2024
a95d558
Add matched_stop token or str to distinguish between eos or stop str …
g-drozdov Oct 17, 2024
9e0dac1
Fix regex and logprob conflicts when chunked prefilling (#1703)
hnyls2002 Oct 18, 2024
6d0fa73
Simplify flashinfer utilities (#1704)
merrymercy Oct 18, 2024
392f286
Add dtype for more operations (#1705)
merrymercy Oct 18, 2024
bc12d40
Add grouped free operations (#1706)
merrymercy Oct 18, 2024
2bcfba1
Skip unnecessary penalizer (#1707)
merrymercy Oct 19, 2024
f0f8a76
Simplify the nan detection and greedy check in sampler (#1709)
merrymercy Oct 19, 2024
3db43d1
Fix `is_all_ready` for overlap copy (#1710)
merrymercy Oct 19, 2024
769bf11
Fix the race condition in overlap mode (#1712)
merrymercy Oct 19, 2024
736f040
Update README.md (#1713)
merrymercy Oct 19, 2024
087257e
Release v0.3.4 (#1714)
merrymercy Oct 19, 2024
b6cd903
Update readme and workflow (#1716)
merrymercy Oct 19, 2024
12cad0f
Simplify the interface of tp_worker (#1718)
merrymercy Oct 20, 2024
8bee20f
Update vllm to 0.6.3 (#1711) (#1720)
zhyncs Oct 20, 2024
cbbc82b
Support qwen2 vl model (#1721)
zhyncs Oct 20, 2024
5c4ce65
Update README.md (#1722)
Ying1123 Oct 20, 2024
9594627
Update README.md
Ying1123 Oct 20, 2024
59cbf47
Unify the memory pool api and tp worker API (#1724)
merrymercy Oct 20, 2024
593b19f
Temporarily skip this test_mixed_batch for QWen2VL (#1725)
merrymercy Oct 20, 2024
b48edff
Split the overlapped version of TpModelWorkerClient into a separate f…
merrymercy Oct 20, 2024
554fbf9
[Bugfix] qwen2vl forward_extend (#1727)
yizhang2077 Oct 20, 2024
e12358d
Simplify the usage of device (#1734)
merrymercy Oct 21, 2024
b121bc0
Simplify batch result resolution (#1735)
merrymercy Oct 21, 2024
45d5af2
Add GLM-4 TextGeneration Model support for SGLang (#1736)
sixsixcoder Oct 21, 2024
cf470fe
Make token mapping non-blocking in the overlapped mode (#1740)
merrymercy Oct 21, 2024
09603c6
Maintain seq_lens_sum to make more FlashInfer operations non-blocking…
merrymercy Oct 21, 2024
efb099c
Fix prefill oom (#1743)
hnyls2002 Oct 21, 2024
7ce3606
Faster overlap mode scheduler (#1738)
merrymercy Oct 21, 2024
e68b9e7
misc: add CODEOWNERS (#1737)
zhyncs Oct 21, 2024
0061128
Fix sliding window attention and gemma-2 unit tests in CI (#1746)
merrymercy Oct 21, 2024
94cde10
Llama3.2 vision model support (#1551)
hnyls2002 Oct 21, 2024
5e1558f
Update `max_req_len` and `max_req_input_len` (#1748)
hnyls2002 Oct 21, 2024
1f26e8b
Release v0.3.4.post1 (#1749)
merrymercy Oct 22, 2024
17536e7
Fix edge case for truncated (#1747)
ByronHsu Oct 23, 2024
ad4125d
Fuse more ops & Simplify token mapping (#1758)
merrymercy Oct 23, 2024
2fce449
[API] add get memory pool size (#1760)
Ying1123 Oct 23, 2024
fbcbb26
Fix perf regression for set_kv_buffer (#1765)
merrymercy Oct 23, 2024
9af7b88
[Fix] Fix abort in dp (#1767)
merrymercy Oct 23, 2024
80a9054
Fix stop condition for <|eom_id|> (#1766)
merrymercy Oct 23, 2024
b7d0559
Update docs (#1768)
merrymercy Oct 23, 2024
9441d92
update
81549361 Oct 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
/python/sglang/lang @merrymercy @Ying1123 @hnyls2002 @ByronHsu
/python/sglang/srt @merrymercy @Ying1123 @hnyls2002 @zhyncs @ispobock @ByronHsu
/python/sglang/srt/constrained @hnyls2002
/python/sglang/srt/layers @merrymercy @Ying1123 @zhyncs @ispobock
/python/sglang/srt/lora @Ying1123
/python/sglang/srt/managers @merrymercy @Ying1123 @hnyls2002
/python/sglang/srt/mem_cache @merrymercy @Ying1123 @hnyls2002
/python/sglang/srt/model_executor @merrymercy @Ying1123 @hnyls2002 @zhyncs @ispobock
/python/sglang/srt/models @merrymercy @Ying1123 @hnyls2002 @zhyncs @ispobock @ByronHsu
/python/sglang/srt/openai_api @merrymercy @Ying1123 @hnyls2002 @zhyncs @ispobock @ByronHsu
/python/sglang/srt/sampling @merrymercy @hnyls2002
/test/lang @merrymercy @Ying1123 @hnyls2002 @ByronHsu
/test/srt @merrymercy @Ying1123 @hnyls2002 @zhyncs @ispobock @ByronHsu
5 changes: 3 additions & 2 deletions .github/workflows/pr-test-amd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ on:
workflow_dispatch:

concurrency:
group: pr-test-${{ github.ref }}
group: pr-test-amd-${{ github.ref }}
cancel-in-progress: true

jobs:
Expand All @@ -28,7 +28,8 @@ jobs:
- name: Install dependencies
run: |
pip install --upgrade pip
pip install -e "python[all]" --no-deps
pip install -e "python[runtime_common, test]"
pip install -e "python" --no-deps

git clone https://github.com/merrymercy/human-eval.git
cd human-eval
Expand Down
39 changes: 19 additions & 20 deletions .github/workflows/pr-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ jobs:
run: |
pip install --upgrade pip
pip install -e "python[dev]"
pip install transformers==4.44
pip install transformers==4.45.2
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ --force-reinstall

- name: Run test
Expand All @@ -49,14 +49,14 @@ jobs:
run: |
pip install --upgrade pip
pip install -e "python[dev]"
pip install transformers==4.44
pip install transformers==4.45.2
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ --force-reinstall

- name: Run test
timeout-minutes: 20
run: |
cd test/srt
python3 run_suite.py --suite minimal --range-begin 0 --range-end 7
python3 run_suite.py --suite minimal --range-begin 0 --range-end 5

unit-test-backend-part-2:
if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request'
Expand All @@ -69,14 +69,14 @@ jobs:
run: |
pip install --upgrade pip
pip install -e "python[dev]"
pip install transformers==4.44
pip install transformers==4.45.2
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ --force-reinstall

- name: Run test
timeout-minutes: 20
timeout-minutes: 30
run: |
cd test/srt
python3 run_suite.py --suite minimal --range-begin 7 --range-end 14
python3 run_suite.py --suite minimal --range-begin 5 --range-end 17

unit-test-backend-part-3:
if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request'
Expand All @@ -89,14 +89,14 @@ jobs:
run: |
pip install --upgrade pip
pip install -e "python[dev]"
pip install transformers==4.44
pip install transformers==4.45.2
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ --force-reinstall

- name: Run test
timeout-minutes: 20
timeout-minutes: 30
run: |
cd test/srt
python3 run_suite.py --suite minimal --range-begin 14
python3 run_suite.py --suite minimal --range-begin 17

performance-test-1-gpu-part-1:
if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request'
Expand All @@ -109,7 +109,7 @@ jobs:
run: |
pip install --upgrade pip
pip install -e "python[all]"
pip install transformers==4.44
pip install transformers==4.45.2
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ --force-reinstall

- name: Benchmark Single Latency
Expand Down Expand Up @@ -147,7 +147,7 @@ jobs:
run: |
pip install --upgrade pip
pip install -e "python[all]"
pip install transformers==4.44
pip install transformers==4.45.2
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ --force-reinstall

- name: Benchmark Offline Throughput (w/o RadixAttention)
Expand Down Expand Up @@ -179,7 +179,7 @@ jobs:
run: |
pip install --upgrade pip
pip install -e "python[all]"
pip install transformers==4.44
pip install transformers==4.45.2
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ --force-reinstall

- name: Benchmark Offline Throughput (TP=2)
Expand Down Expand Up @@ -211,7 +211,7 @@ jobs:
run: |
pip install --upgrade pip
pip install -e "python[all]"
pip install transformers==4.44
pip install transformers==4.45.2
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ --force-reinstall

git clone https://github.com/merrymercy/human-eval.git
Expand All @@ -235,7 +235,7 @@ jobs:
run: |
pip install --upgrade pip
pip install -e "python[all]"
pip install transformers==4.44
pip install transformers==4.45.2
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ --force-reinstall

git clone https://github.com/merrymercy/human-eval.git
Expand All @@ -255,12 +255,11 @@ jobs:
python3 test_mla.py
python3 test_mla_fp8.py

# Temporarily disabled
#- name: Evaluate Data Parallelism Accuracy (TP=2)
# timeout-minutes: 10
# run: |
# cd test/srt
# python3 test_data_parallelism.py
- name: Evaluate Data Parallelism Accuracy (DP=2)
timeout-minutes: 10
run: |
cd test/srt
python3 test_data_parallelism.py

finish:
needs: [
Expand Down
74 changes: 55 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<div align="center">
<img src="https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png" alt="logo" width="400"></img>
<div align="center" id="sglangtop">
<img src="https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png" alt="logo" width="400" margin="10px"></img>

[![PyPI](https://img.shields.io/pypi/v/sglang)](https://pypi.org/project/sglang)
![PyPI - Downloads](https://img.shields.io/pypi/dm/sglang)
Expand All @@ -11,20 +11,18 @@

--------------------------------------------------------------------------------

| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) | [**Join Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw) | [**Join Bi-Weekly Development Meeting (Oct. 19)**](https://calendar.app.google/GYW7S8QGoanCuaxW6) |

## Upcoming Events
- [Oct. 11, 2024] Invited talks at [AMD Advancing AI](https://www.amd.com/en/corporate/events/advancing-ai.html) Developer Day.
- [Oct. 16, 2024] Online meetup for efficient LLM deployment and serving, co-hosted by SGLang, FlashInfer, and MLC LLM! Fill out the [Google form](https://forms.gle/B3YeedLxmrrhL1NM8) to receive the invite link.
| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) | [**Slides**](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_dev_day_v2.pdf) | [**Learn More**](https://github.com/sgl-project/sgl-learning-materials) | [**Join Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw) |
[**Join Bi-Weekly Development Meeting**](https://docs.google.com/document/d/1xEow4eIM152xNcRxqZz9VEcOiTQo8-CEuuQ5qTmkt-E/edit?usp=sharing) |

## News
- [2024/09] 🔥 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
- [2024/07] 🔥 Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
- [2024/02] SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
- [2024/10] 🔥 The First SGLang Online Meetup ([slides](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#the-first-sglang-online-meetup)).
- [2024/09] SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
- [2024/07] Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).

<details>
<summary>More</summary>

- [2024/02] SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
- [2024/04] SGLang is used by the official **LLaVA-NeXT (video)** release ([blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)).
- [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
- [2024/01] SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)).
Expand Down Expand Up @@ -58,23 +56,27 @@ You can install SGLang using any of the methods below.
pip install --upgrade pip
pip install "sglang[all]"

# Install FlashInfer CUDA kernels
# Install FlashInfer accelerated kernels
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
```

**Important: Please check the [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html) to install the proper version according to your PyTorch and CUDA versions.**

### Method 2: From source
```
# Use the last release branch
git clone -b v0.3.3 https://github.com/sgl-project/sglang.git
git clone -b v0.3.4.post1 https://github.com/sgl-project/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

# Install FlashInfer CUDA kernels
# Install FlashInfer accelerated kernels
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
```

**Important: Please check the [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html) to install the proper version according to your PyTorch and CUDA versions.**

### Method 3: Using docker
The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
Expand Down Expand Up @@ -228,7 +230,8 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
```
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes.
- To enable the experimental overlapped scheduler, add `--enable-overlap-scheduler`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly.
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currenly.
- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
Expand All @@ -242,6 +245,35 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1
```

### Engine Without HTTP Server

We also provide an inference engine **without a HTTP server**. For example,

```python
import sglang as sgl

def main():
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print("===============================")
print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

if __name__ == "__main__":
main()
```

This can be used for offline batch inference and building custom servers.
You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine).

### Supported Models

**Generative Models**
Expand Down Expand Up @@ -271,6 +303,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
- MiniCPM / MiniCPM 3
- XVERSE / XVERSE MoE
- SmolLM
- GLM-4

**Embedding Models**

Expand Down Expand Up @@ -407,7 +440,6 @@ print(state["answer_1"])
```

#### More Examples

Anthropic and VertexAI (Gemini) models are also supported.
You can find more examples at [examples/quick_start](examples/frontend_language/quick_start).

Expand Down Expand Up @@ -578,14 +610,18 @@ def chat_example(s):
- The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`.

## Benchmark And Performance
![8b_throughput](https://lmsys.org/images/blog/sglang_llama3/8b_throughput.svg)
![70b_fp8_throughput](https://lmsys.org/images/blog/sglang_llama3/70b_fp8_throughput.svg)

Learn more at this [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/).
Learn more in our release blogs: [v0.2](https://lmsys.org/blog/2024-07-25-sglang-llama3/), [v0.3](https://lmsys.org/blog/2024-09-04-sglang-v0-3/).

## Roadmap
[Development Roadmap (2024 Q4)](https://github.com/sgl-project/sglang/issues/1487)

## Citation And Acknowledgment
Please cite our paper, [SGLang: Efficient Execution of Structured Language Model Programs](https://arxiv.org/abs/2312.07104), if you find the project useful.
We also learned from the design and reused code from the following projects: [Guidance](https://github.com/guidance-ai/guidance), [vLLM](https://github.com/vllm-project/vllm), [LightLLM](https://github.com/ModelTC/lightllm), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [Outlines](https://github.com/outlines-dev/outlines), and [LMQL](https://github.com/eth-sri/lmql).


<p align="center">
<a href="#sglangtop" target="_blank">
<bold>Back To Top </bold>
</a>
</p>
2 changes: 1 addition & 1 deletion benchmark/llava_bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ pip3 install "torch>=2.1.2" "transformers>=4.36" pillow
### Benchmark sglang
Launch a server
```
python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --port 30000
python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --port 30000
```

Run benchmark
Expand Down
2 changes: 1 addition & 1 deletion benchmark/llava_bench/bench_sglang.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ def image_qa(s, image_file, question):


def main(args):
lines = read_jsonl(args.question_file)[: args.num_questions]
lines = list(read_jsonl(args.question_file))[: args.num_questions]
arguments = [
{
"image_file": os.path.abspath(args.image_folder + "/" + l["image"]),
Expand Down
31 changes: 30 additions & 1 deletion docs/en/backend.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
```
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes.
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currenly.
- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
Expand All @@ -93,6 +93,35 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1
```

### Engine Without HTTP Server

We also provide an inference engine **without a HTTP server**. For example,

```python
import sglang as sgl

def main():
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print("===============================")
print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

if __name__ == "__main__":
main()
```

This can be used for offline batch inference and building custom servers.
You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine).

### Supported Models

**Generative Models**
Expand Down
6 changes: 5 additions & 1 deletion docs/en/benchmark_and_profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,4 +46,8 @@ pip install nvtx
import nvtx
with nvtx.annotate("description", color="color"):
# some critical code
```
```

## Other tips

1. You can benchmark a model using dummy weights by only providing the config.json file. This allows for quick testing of model variants without training. To do so, add `--load-format dummy` to the above commands and then you only need a correct `config.json` under the checkpoint folder.
1 change: 0 additions & 1 deletion docs/en/frontend.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,6 @@ print(state["answer_1"])
```

#### More Examples

Anthropic and VertexAI (Gemini) models are also supported.
You can find more examples at [examples/quick_start](https://github.com/sgl-project/sglang/tree/main/examples/frontend_language/quick_start).

Expand Down
Loading
Loading