Releases: sgl-project/sglang
Release v0.4.1
Highlights
-
We're excited to announce SGLang v0.4.1, which now supports DeepSeek V3 - currently the strongest open-source LLM, even surpassing GPT-4o.
The SGLang and DeepSeek teams worked together to get DeepSeek V3 FP8 running on NVIDIA and AMD GPU from day one. We've also supported MLA optimization and DP attention before, making SGLang one of the best open-source LLM engines for running DeepSeek models.
Special thanks to Meituan's Search & Recommend Platform Team @ispobock @HandH1998 and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.
-
Various improvements to the cache-aware sglang router, torchao integration, server termination
-
Added a standalone package sgl-kernel for supporting more custom kernels in the code base.
What's Changed
- Adding SGLang FP8 Utils by @HaiShaw in #2348
- docs: add SGLang v0.4 blog by @zhyncs in #2341
- MLA prefill w/o weight absorption by @ispobock in #2349
- Check gpu availability at server args creation by @MrAta in #2340
- minor: limit the range of vllm versions by @zhyncs in #2350
- Fix Docs CI When Compile Error by @zhaochenyang20 in #2323
- Add Docs For SGLang Native Router by @zhaochenyang20 in #2308
- Make torch TP composable with torch.compile by @kwen2501 in #2352
- move apply_torchao_config_ to model_runner by @jerryzh168 in #2342
- [Minor] Code style improvements by @merrymercy in #2355
- Fix AWQ with enable MLA by @ispobock in #2364
- MoE Expert Parallel by @xiaobochen123 in #2371
- Move FP8 to SGLang by @zhyncs in #2370
- optimize cuda graph max_bs_settings on low-end gpus by @BBuf in #2360
- Add more support for intel Gaudi accelerators by @YangQun1 in #2357
- [router] support
/add_worker
api by @ByronHsu in #2369 - docs: update adoption (Meituan) by @zhyncs in #2373
- Use proc.join instead of busy waiting by @merrymercy in #2374
- docs: Improve instructions for supporting new models by @vchzls in #2363
- Fix the overlap for xgrammar by @merrymercy in #2377
- Release v0.4.0.post1 by @merrymercy in #2375
- [Router] remove duplicate char count by @ByronHsu in #2378
- [router] add remove tenant method in the radix tree by @ByronHsu in #2379
- [router] Add remove worker api by @ByronHsu in #2380
- fix: resolve fp8 moe issue by @zhyncs in #2387
- fix: update xgrammar v0.1.6 by @zhyncs in #2390
- Fp8 MoE optimizations on AMD by @HaiShaw in #2388
- minor: update killall script by @zhyncs in #2391
- [router] Health check on worker before added to the router by @ByronHsu in #2392
- Fix shape error that occurred when loading lora weight of gemma2 model. by @upskyy in #2330
- nit: Remove busy waiting on scheduler by @rkooo567 in #2382
- Optimize Triton decoding kernel for long context by @ispobock in #2394
- Update killall_sglang.sh by @merrymercy in #2397
- Remove unused vars in the triton backend by @ispobock in #2401
- Fix a bug with logprob streaming + chunked prefill by @merrymercy in #2403
- fix: specify dtype with begin_forward aka plan by @zhyncs in #2404
- Fix recv_requests by @merrymercy in #2405
- minor: update correct measurement unit by @zhyncs in #2406
- feat: support custom task runner by @zhyncs in #2407
- minor: add random use case by @zhyncs in #2408
- minor: add random flashinfer vs triton use case by @zhyncs in #2409
- Simplify stream_output by @merrymercy in #2398
- [router] Improve cleanup logic by @ByronHsu in #2411
- [Router] fix interrupt from terminal by @ByronHsu in #2413
- [router] defer health checking to router init by @ByronHsu in #2393
- reduce watchdog interval to 5s by @ByronHsu in #2410
- Add a unittest for fused_moe by @BBuf in #2416
- [Minor] Improve code style by @merrymercy in #2419
- [Minor] Improve code style by @merrymercy in #2422
- [feat] Enable chunked prefill for llava-onevision by @Ying1123 in #2412
- Typo fix in router.md by @adarshxs in #2424
- feat: support sgl-kernel PyPI by @zhyncs in #2433
- fix: use manylinux2014_x86_64 tag by @zhyncs in #2434
- fix: compatible with PEP 440 by @zhyncs in #2435
- [router] Refactor: decouple select and send stage by @ByronHsu in #2440
- [router] Use borrow if possible to save cost by @ByronHsu in #2441
- Make torch TP composable with torchao by @kwen2501 in #2436
- chore: update ao v0.7.0 by @zhyncs in #2447
- decoding attention kernel benchmark by @bjmsong in #2425
- Fix model loader for more quantization formats by @merrymercy in #2448
- Fix warmup in bench_offline_throughput.py by @merrymercy in #2449
- Add support for IBM Granite 3.x models by @frreiss in #2437
- [router] Add retries based fault tolerance by @ByronHsu in #2452
- [router] remove main.rs because only lib.rs is used for py binding by @ByronHsu in #2453
- [Core] in batch prefix caching by delay scheduling by @rkooo567 in #2442
- [router] Update doc for dynamic scaling and fault tolerance by @ByronHsu in #2454
- [router] Release router 0.1.0 with dynamic scaling and fault tolerance by @ByronHsu in #2455
- Make request payload size configurable by @MrAta in #2444
- Include version info into the router package by @MrAta in #2456
- Bump sglang-router to 0.1.1 by @MrAta in #2459
- chore: bump v0.0.2 for sgl-kernel by @zhyncs in #2462
- minor: update pypi tag by @zhyncs in #2463
- fix: set runtime path by @zhyncs in #2466
- Rename rust folder to sgl-router by @MrAta in #2464
- feat: support dev image by @zhyncs in #2469
- [Minor] Fix grok model loader by @merrymercy in #2473
- Fix correctness issue for triton decoding kernel by @ispobock in #2479
- format: add clang-format for sgl-kernel by @zhyncs in #2483
- Remove cuda graph batch size adjustment for dp attention by @ispobock in #2484
- hotfix: checking for HIP by @zhyncs in #2485
- sgl-kernel adapt tensorrt llm custom allreduce by @yizhang2077 in #2481
- fix typo by @zhyncs in #2487
- [Benchmark] add a benchmark for hf/vllm/sglang rmsnorm by @BBuf in #2486
- fix moe-ep accuracy issue for fp8 by @xiaobochen123 in #2489
- minor: update flashinfer nightly by @zhyncs in #2490
- Small fixes for torchao quant by @jerryzh168 in #2476
- Simplify pytorch sampling kernel and logit processor by @merrymercy in #2491
- Temporarily disable unit test of torch native attention backe...
Release v0.4.0
Highlights
blog: https://lmsys.org/blog/2024-12-04-sglang-v0-4/
We’re excited to release SGLang v0.4, featuring significant performance improvements and new features:
- Zero-overhead batch scheduler: 1.1x increase in throughput.
- Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate.
- Data parallelism attention for DeepSeek models: up to 1.9x decoding throughput improvement.
- Fast structured outputs with xgrammar: up to 10x faster.
What's Changed
- fix: add xgrammar dependency by @zhyncs in #2126
- docs: fix module docstrings and copyright headers by @XuehaiPan in #2077
- feat(pre-commit): trim unnecessary notebook metadata from git history by @XuehaiPan in #2127
- Expose max total num tokens from Runtime & Engine API by @henryhmko in #2092
- Only stream output on tp rank 0 by @merrymercy in #2124
- Revert "Only stream output on tp rank 0" by @merrymercy in #2130
- Add initial support for intel Gaudi accelerators by @ankurneog in #2121
- Add simple CPU offloading support. by @janimo in #2081
- Fix grid size in Triton decoding kernel by @ispobock in #2134
- [CI] Fix test cases by @merrymercy in #2137
- Add concurrency option for benchmark by @cermeng in #2136
- Fix dp print message by @merrymercy in #2138
- fix: resolve bench_serving args by @zhyncs in #2139
- [router] cache-aware load-balancing router v1 by @ByronHsu in #2114
- Bump sglang-router to 0.0.5 by @ByronHsu in #2142
- update router doc by @ByronHsu in #2143
- fix dp_rank env by @ByronHsu in #2144
- Add more api routes (completion, health, etc) to the router by @ByronHsu in #2146
- add prefix match for certain tenant by @ByronHsu in #2147
- Improve sglang router by @ByronHsu in #2148
- Merged three native APIs into one: get_server_info by @henryhmko in #2152
- feat: remove the dependency on FusedMoE by @zhyncs in #2153
- feat: update gitignore and add tuning config for FusedMoE by @zhyncs in #2155
- fix: resolve end-of-file-fixer by @zhyncs in #2157
- Simplify
Scheduler.update_running_batch
by @merrymercy in #2154 - feat: update other MoE models deps by @zhyncs in #2156
- Update CI threshold & Improve code style by @merrymercy in #2159
- fix: use torch.sum for compatible by @zhyncs in #2161
- Fix mixed chunked prefill in overlap mode by @merrymercy in #2158
- Balance CI tests by @merrymercy in #2162
- Rename triton_fused_moe -> fused_moe_triton by @merrymercy in #2163
- Fix docs by @merrymercy in #2164
- [Fused moe] add tuning fused configs for qwen2 57b and mixtral 8x7b by @BBuf in #2167
- Allow overwrite flashinfer use_tensorcore by @merrymercy in #2169
- Replace prob based with threshold based load balancing by @ByronHsu in #2170
- feat: fused_moe fp8 monkey patch by @zhyncs in #2174
- [Fix] Avoid calling fill_vocab_mask for terminated requests by @Ubospica in #2175
- [CI] Split test cases in CI for better load balancing by @merrymercy in #2180
- Bump rustls from 0.23.16 to 0.23.18 in /rust by @dependabot in #2182
- [feat] Refactor session control interface and add CI by @Ying1123 in #2173
- [router] Replace print with logger by @ByronHsu in #2183
- Use custom allreduce w/ torch.compile by @merrymercy in #2185
- [Performance]: Process affinity to CPU cores with multiple sockets support by @HaiShaw in #2171
- Update CI threshold by @merrymercy in #2186
- Update XGrammar to the latest API by @Ubospica in #2176
- [router] Rust e2e test by @ByronHsu in #2184
- Input_embeds support by @RinRin-32 in #2052
- [CI] Minor fix for CI by @merrymercy in #2187
- Rename double sparsity config file by @merrymercy in #2188
- Release v0.3.6.post1 by @merrymercy in #2189
- Update sampler.py to skip the success check by @merrymercy in #2197
- remove unused imports by @WrRan in #2195
- Remove unresolved reference 'self' by @apemost in #2198
- using
is not
not!=
to testNone
by @WrRan in #2196 - fix: add cuda-python for xgrammar by @zhyncs in #2199
- minor: update check_env by @zhyncs in #2201
- add sglang version to get_server_info by @binarycrayon in #2206
- docs: update adoption by @zhyncs in #2204
- Bump router to 0.0.9 with better logging by @ByronHsu in #2207
- Fix rust warning by @ByronHsu in #2208
- Fix flasky tests by @merrymercy in #2212
- [feat] Support session control for vision language models by @Ying1123 in #2210
- Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default by @merrymercy in #2217
- Revert "Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default" by @merrymercy in #2221
- Use an env var SGLANG_SET_CPU_AFFINITY to set cpu affinity; turn it off by default by @merrymercy in #2222
- Release v0.3.6.post2 by @merrymercy in #2214
- Rename DP_RANK to SGLANG_DP_RANK by @merrymercy in #2218
- [3rdparty, document] Updated Documentation that for triton fused_moe kernel tuning for AMD Instinct GPUs by @kkHuang-amd in #2191
- Bump sglang-router to 0.0.10 for env name change by @ByronHsu in #2226
- fix typo prompts by @qibaoyuan in #2224
- Remove fused_moe_grok by @merrymercy in #2223
- add profile in offline benchmark & update doc by @bjmsong in #2123
- Rename tuned MI300X config files for fused_moe_triton by @HaiShaw in #2228
- Update Install Method 2. From source by @HaiShaw in #2232
- Fix chunked prefill size for bench_offline_throughput by @merrymercy in #2234
- Disable overlap scheduler for multimodal models by @merrymercy in #2235
- Add OLMo2 model. by @janimo in #2233
- Crash the server correctly during error by @merrymercy in #2231
- Fix memory leak during abort by @merrymercy in #2238
- fix missing launch server import by @qeternity in #2242
- [fix] Fix prefix caching for multi-image/video by @Ying1123 in #2239
- Update backend.md by @merrymercy in #2250
- Update backend.md by @merrymercy in #2251
- Revert "Add simple CPU offloading support" by @Ying1123 in #2252
- Revert "Revert "Add simple CPU offloading support"" by @Ying1123 in #2253
- Simplify tokenizer manager by @merrymercy in #2254
- Fix hash collision for multi modal models by @merrymercy in #2256
- [Minor] fix the style for multimodal models by @merrymercy in #2257
- chore: bump v0.3.6.post3 by @zhyncs in https://github.com/sgl-project/sglang/pul...
Release v0.3.6
Highlights
- Reduce CPU overhead by enabling overlap scheduler by default. 1.1x higher throughput. (#2105, #2067, #2095)
- Support data parallelism for attention and MLA. 1.5x higher decoding throughput. (#1970, #2061)
- Cache-aware load balancer. 4x higher cache hit rate (#1934)
- Support xgrammar backend for grammar-guided decoding (#2056)
- Support Prometheus metrics (#1853, #1981)
- Support torch 2.5.1 (#2069) and torch-native tensor parallelism (#1876)
- Support graceful termination (#1838) and watchdog (#1816)
- Support notebook-style documentation (https://sgl-project.github.io/)
- Add an offline benchmark script (#1968)
- Bug, deadlock, NaN, and OOM fixes (#2083, #1850, #1800, #1779, #1789, #1858)
- New models: Phi3-small (#2062), Gemma-2 reward model (#1954), GPT-2 (#1833)
What's Changed
- Fix edge case for truncated by @ByronHsu in #1747
- Fuse more ops & Simplify token mapping by @merrymercy in #1758
- [API] add get memory pool size by @Ying1123 in #1760
- Fix perf regression for set_kv_buffer by @merrymercy in #1765
- [Fix] Fix abort in data parallelism by @merrymercy in #1767
- Fix stop condition for <|eom_id|> by @merrymercy in #1766
- Update docs by @merrymercy in #1768
- Fix missing additional_stop_token_ids by @merrymercy in #1769
- Fix out of memory message. by @hnyls2002 in #1771
- Crash the server on warnings in CI by @merrymercy in #1772
- Fix the perf regression due to additional_stop_token_ids by @merrymercy in #1773
- Fix MockTokenizer in the unit tests by @merrymercy in #1774
- [Bug] Catch any errors caused by parsing json schema by @zolinthecow in #1776
- [Fix] Fix NaN issues by fixing the cuda graph padding values for flashinfer by @merrymercy in #1779
- [Fix] Fix cuda graph padding for triton attention backend by @merrymercy in #1782
- check user-specified model_max_len with hf derived max_model_len by @BBuf in #1778
- Re-introduce
get_cuda_graph_seq_len_fill_value
by @merrymercy in #1783 - Enhance the test case for chunked prefill and check memory leak by @merrymercy in #1785
- Fix seq_lens_sum for cuda graph runner in padded cases by @merrymercy in #1789
- Qwen2vl support cuda graph and disable radix cache by @yizhang2077 in #1780
- Fix log parsing in the chunked prefill unit tests by @merrymercy in #1793
- Fix memory leak when doing chunked prefill by @hnyls2002 in #1787
- [Fix] Fix the log parsing in chunked prefill uni tests by @merrymercy in #1794
- Revert "Fix memory leak when doing chunked prefill" by @merrymercy in #1797
- Fix logprob in the overlapped mode by @merrymercy in #1795
- Release v0.3.4.post2 by @merrymercy in #1796
- [Performance] Support both xgrammar and outlines for constrained decoding by @DarkSharpness in #1752
- [Fix] Fix --skip-tokenizer-init by @merrymercy in #1798
- move max_position_embeddings to the last by @hliuca in #1799
- add support for ipynb by @zhaochenyang20 in #1786
- Fix possible ZMQ hanging by @hnyls2002 in #1800
- Set
ZMQ
buffer size heuristic by @hnyls2002 in #1801 - Allow consecutive ports when launching multiple sglang servers. by @hnyls2002 in #1802
- fix int conversion for
SGLANG_CPU_COUNT
by @ByronHsu in #1803 - Update ci workflows by @merrymercy in #1804
- Update links by @merrymercy in #1805
- Simplify our docs with complicated functions into utils by @zhaochenyang20 in #1807
- Fix docs ci by @zhaochenyang20 in #1808
- Provide an argument to set the maximum batch size for cuda graph by @merrymercy in #1809
- Improve the user control of new_token_ratio by @merrymercy in #1811
- Update hyperparameter_tuning.md by @merrymercy in #1813
- Add a watch dog thread by @merrymercy in #1816
- Fix unit tests by @merrymercy in #1817
- Add openAI compatible API by @zhaochenyang20 in #1810
- Fix Triton decode kernel & ut by @ispobock in #1819
- support token ids in
engine.generate
by @ByronHsu in #1820 - Fix docs deploy ci by @zhaochenyang20 in #1821
- [router] rust-based router by @ByronHsu in #1790
- Fix update_weights deadlock for DP by @ByronHsu in #1825
- fix get_memory_pool_size deadlock for DP by @ByronHsu in #1830
- Support setting
use_thread
in therun_program
for easier debugging. by @liuyanyi in #1823 - [3rdparty, document] Add 3rdparty/amd, with profiling and tuning instructions to be added by @HaiShaw in #1822
- stop_str of qwen2-vl template should be a tuple not a str by @yizhang2077 in #1834
- [FP8 KV Cache, Mixtral] Avoid KeyError at loading pre-quantized FP8 m… by @HaiShaw in #1835
- Gpt2 by @DanielC12321 in #1833
- Imporve openai api documents by @zhaochenyang20 in #1827
- Update docs by @merrymercy in #1839
- Update README.md by @merrymercy in #1840
- [Production] Drain requests before exit when receive SIGTERM by @Ying1123 in #1838
- [Performance, Hardware] MoE weights padding to AMD MI300x GPUs by @HaiShaw in #1836
- Fix suggest edit by @zhaochenyang20 in #1842
- [Performance, Triton Kernel Args] _decode_grouped_softmax_reducev_fwd… by @HaiShaw in #1845
- Make decode log interval configurable by @ByronHsu in #1847
- Fix mixed chunked prefill by @merrymercy in #1850
- Refactor tokenizer manager by @ByronHsu in #1846
- Simplify documentation by @merrymercy in #1851
- Fix warnings in doc build by @merrymercy in #1852
- delete unused character by @geeker-smallwhite in #1855
- Fix memory leak for chunked prefill 2 by @merrymercy in #1858
- [Build, ROCm] Dockerfile.rocm for Instinct GPUs, with package updates by @HaiShaw in #1861
- Fix retraction + overlap by @hnyls2002 in #1860
- change file tree by @zhaochenyang20 in #1859
- Update vocab embedding deps and add TP switch by @ispobock in #1856
- minor: add human eval by @zhyncs in #1754
- Add vlm document by @zhaochenyang20 in #1866
- minor: update nightly eval by @zhyncs in #1867
- [3rdparty, document] Updated Documentation that covers performance tuning techniques for AMD Instinct GPUs. by @yichiche in #1871
- Improve docs and fix the broken links by @merrymercy in #1875
- Add a FAQ documentation by @merrymercy in #1877
- Update docs title by @merrymercy in #1879
- Update docs and workflow by @merrymercy in #1881
- Fix doc links by @merrymercy in #1882
- Fix incorrect context length for llama3.2-11b by @rchen19 in #1873
- add native api docs by @zhaochenyang20 in #1883
- Update index.rst to improve the order of docs by @merrymercy in #1885
- Native api by...
Release v0.3.4.post1
Highlights
- Hosted the first LMSYS online meetup: Efficient LLM Deployment and Serving.
- Covered CPU overhead hiding, faster constrained decoding, and DeepSeek MLA. Slides
- Added Engine API for offline inference with reduced overhead. Usage. #1614 #1567
- Added an overlap scheduler for reducing CPU overhead #1738
- New models: Llama 3.2 (#1551), QWen-VL2 (#1721), OLMo (#1676), GLM 4 (#1736).
- Added support for reward models #1525.
- Added support for Intel XPU #1480.
- Improved stability for greedy decoding #1589.
- Accelerated multi-LoRA serving #1587.
What's Changed
- [Fix] Ignore model import error by @merrymercy in #1513
- minor: fix config by @hnyls2002 in #1524
- [Event] Update meeting link by @Ying1123 in #1529
- [Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B by @Ying1123 in #1525
- Add float8 dynamic quant to torchao_utils by @jerryzh168 in #1528
- [FIX] Catch syntax error of Regex Guide to avoid crash by @du00cs in #1521
- [bugfix]Add modelscope package to avoid docker image without modelscope by @KylinMountain in #1520
- Fix RuntimeEndpoint.select method by @jeffrey-fong in #1495
- Multiple minor fixes by @merrymercy in #1530
- Make detokenizer_manager.py not asyncio by @merrymercy in #1532
- Organize image inputs by @hnyls2002 in #1531
- Improve process creation by @merrymercy in #1534
- fix ipv6 url when warm up model by @cauyxy in #1537
- Move scheduler code from tp_worker.py to scheduler.py by @merrymercy in #1538
- Process image in parallel by @hnyls2002 in #1539
- Let ModelRunner take InputMetadata as input, instead of ScheduleBatch by @merrymercy in #1541
- Rename InputMetadata -> ForwardBatch by @merrymercy in #1543
- Clean up batch data structures: Introducing ModelWorkerBatch by @merrymercy in #1544
- [Fix, LoRA] fix LoRA with updates in main by @Ying1123 in #1545
- Organize Attention Backends by @hnyls2002 in #1547
- Fix bugs of
logprobs_nums
by @hnyls2002 in #1548 - Dispatch flashinfer wrappers by @hnyls2002 in #1550
- Simplify flashinfer dispatch by @hnyls2002 in #1552
- [Refactor] Simplify io_struct and tokenizer_manager by @Ying1123 in #1549
- [Performance, Hardware] MoE tuning on AMD MI300x GPUs by @kkHuang-amd in #1554
- [Fix] Fix all the Huggingface paths by @tbarton16 in #1553
- [Fix] do not maintain regex_fsm in SamplingBatchInfo by @merrymercy in #1555
- [Fix] Move ScheduleBatch out of SamplingInfo by @merrymercy in #1556
- Move status check in the memory pool to CPU by @merrymercy in #1557
- [Fix] Fix AttributeError in Qwen2.5 LoRA: 'Qwen2ForCausalLM' object has no attribute 'get_hidden_dim' by @mssongit in #1536
- [FP8 KV Cache] Avoid KeyError at loading pre-quantized FP8 model with kv_scale by @HaiShaw in #1559
- Organize sampling batch info better by @merrymercy in #1562
- Use ipc instead of tcp in zmq by @merrymercy in #1566
- Make input_ids a torch.Tensor by @merrymercy in #1568
- [Minifix] Remove extra space in cot example by @FredericOdermatt in #1569
- [Fix] Fix major performance bug in certain cases by @Ying1123 in #1563
- Refine the add request reasons to avoid corner cases. by @hnyls2002 in #1574
- chore: update README.md by @eltociear in #1580
- [Easy] use .text() instead of .text by @ByronHsu in #1577
- [Event] Update README.md by @Ying1123 in #1572
- Add llama implementation with no tensor parallel linears by @jerryzh168 in #1561
- Backend method not found when SRT Runtime is used by @ByronHsu in #1576
- default sampling param should be deepcopied by @ByronHsu in #1581
- Fix styling by @ByronHsu in #1583
- Fix runtime.generate when sampling param is not passed by @ByronHsu in #1582
- Support min_tokens in sgl.gen by @ByronHsu in #1573
- [Minor] Improve the style and fix flaky tests by @merrymercy in #1584
- [Bug] Fix decode stats error on output_len 1 by @HaiShaw in #1585
- Clean up event loop by @merrymercy in #1586
- [LoRA, Performance] Speedup multi-LoRA serving - Step 1 by @Ying1123 in #1587
- [Minor, Performance] Use torch.argmax for greedy sampling by @Ying1123 in #1589
- Test consistency for single and batch seperately by @ByronHsu in #1590
- Update README.md by @merrymercy in #1591
- Fix modality for image inputs by @merrymercy in #1592
- Provide an offline engine API by @ByronHsu in #1567
- [Fix] Fix the case where prompt_len = 0 by @merrymercy in #1593
- Use
atexit
hook to implicitly shutdownRuntime
by @ByronHsu in #1595 - Use is_flashinfer_available to replace is_hip for flashinfer check by @merrymercy in #1596
- Fix chunked prefill condition by @ispobock in #1594
- Fix the port_args in bench_latency by @merrymercy in #1597
- Remove references to squeezellm by @janimo in #1603
- [Profile] Add pytorch profiler by @Ying1123 in #1604
- [Engine] Fix generate hanging issue after the first call by @ByronHsu in #1606
- Release v0.3.3 by @merrymercy in #1605
- [Minor] Fix logging typo by @amosyou in #1615
- Fix test_vision_openai_server on CI by @ByronHsu in #1620
- [Performance, hardware] MoE tuning update to AMD MI300x GPUs by @HaiShaw in #1619
- Update README.md by @kushal34712 in #1625
- Update README.md by @merrymercy in #1629
- Add device support by @liangan1 in #1607
- Nit about the decorator of
PortArgs.init_new
by @glen-amd in #1611 - [Bug] Fix the Image Input of Batch Generation by @OBJECT907 in #1579
- Add the ability to enable and disable the Profiler via HTTP API. by @Abatom in #1626
- Fix the correctness test in bench_latency.py when tp > 1 and test_generation_models.py by @merrymercy in #1631
- Add image_token in conversation.py by @merrymercy in #1632
- Added a "Back To Top" Button by @JanumalaAkhilendra in #1633
- Fix constrained decoding by @merrymercy in #1634
- Add back data parallelism by @merrymercy in #1635
- Release v0.3.3.post1 by @merrymercy in #1636
- [engine] support async and streaming by @ByronHsu in #1614
- [Fix] Fix the style of test_large_max_new_tokens.py by @merrymercy in #1638
- fix missing ignore_eos in v1/chat/completions by @learninmou in #1642
- Fix ignore_eos in the OpenAI ChatCompletions API by @merrymercy in #1645
- [Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch by @liangan1 in #1480
- Fix...
Release v0.3.2
Highlight
- Support torch.compile, cuda graph for triton attention backend and DeepSeek MLA #1442 #1422
- Initial support for multi-LoRA serving #1307
- Integrate torchao for quantization #1341
- Optimize the CPU scheduler overhead
- Multiple critical bug fixes for llama and llava (tokenizer, modality)
- Support AMD backend #1420
- New models: MiniCPM3, OLMoE
What's Changed
- Remove useless fields in global_config.py by @merrymercy in #1328
- docs: update README by @zhyncs in #1336
- docs: highlight ttft itl and throughput by @zhyncs in #1337
- docs: add conclusion by @zhyncs in #1340
- Optimize schedule by @hnyls2002 in #1339
- Fix some online scheduling delay by @hnyls2002 in #1345
- [triton] Support head_dim not 2^n in triton extend and decode attention by @ByronHsu in #1281
- [Feat] Add modalities for vision server when handling pixel values for llava by @kcz358 in #1346
- [server] Passing
model_override_args
tolaunch_server
via the CLI. by @kevin85421 in #1298 - [Minor] Many cleanup by @merrymercy in #1357
- Add torchao quant (int4/int8/fp8) to llama models by @jerryzh168 in #1341
- [CI] Return output logprobs in unit test by @Ying1123 in #1361
- Unify forward mode by @hnyls2002 in #1360
- Support OpenAI API json_schema response format by @zifeitong in #1363
- Adding Documentation for installation by @zhaochenyang20 in #1300
- [Docs] Improve documentations by @merrymercy in #1368
- fix bug of
undefined is_single
in methcreate_abort_task
by @wcsjtu in #1370 - Support MiniCPM3 by @Achazwl in #1371
- Fix CORS compatibility with OpenAI, vLLM, TGI, LMDeploy by @josephrocca in #1373
- [Minor] improve kill scripts and torchao import by @merrymercy in #1375
- Fix vocab mask update bug by @hnyls2002 in #1376
- [Minor] move triton attention kernels into a separate folder by @merrymercy in #1379
- Deprecate --disable-flashinfer and introduce --attention-backend by @merrymercy in #1380
- Organize flashinfer indices update by @hnyls2002 in #1378
- remove assertion in triton attention and add an unit test by @ByronHsu in #1385
- BaiChuan2 Model by @blacker521 in #1367
- [Fix] Fix --disable-flashinfer by @merrymercy in #1389
- Improve error reporting during server launch by @merrymercy in #1390
- Refactor attention backend by @merrymercy in #1381
- Add no commit to main rule by @hnyls2002 in #1393
- Fix README format by @Achazwl in #1399
- Support cuda graph in the triton attention backend by @merrymercy in #1401
- kernel: use tensor cores for flashinfer gqa kernels by @yzh119 in #1403
- [Minor Fix] Fix llava modalities issue for single-image by @kcz358 in #1402
- Add Support for XVERSE Models (Dense and MoE) to sglang by @hxer7963 in #1397
- [Feature] Initial support for multi-LoRA serving by @Ying1123 in #1307
- [Minor, CI] remove lora test from minimal suite by @Ying1123 in #1406
- Make stop reason a dict instead of str by @merrymercy in #1407
- [CI] Include triton backend and online serving benchmark into CI by @merrymercy in #1408
- [Minor] Raise exception for wrong import by @Ying1123 in #1409
- Balance test in CI by @merrymercy in #1411
- Update pr-test.yml by @merrymercy in #1412
- ci: fix finish by @zhyncs in #1414
- Optimize conflicts between CUDA graph and vocab mask tensors by @hnyls2002 in #1392
- Add torchao quant for mixtral and qwen_moe by @jerryzh168 in #1418
- Add pytorch sampling backend ut by @ispobock in #1425
- fix: resolve nightly eval by @zhyncs in #1426
- Enable torch.compile for triton backend by @merrymercy in #1422
- Add libibverbs-dev to Dockerfile by @Aphoh in #1427
- Update backend.md by @merrymercy in #1429
- [Fix] Fix logprob and normalized_logprob by @merrymercy in #1428
- Release v0.3.1 by @merrymercy in #1430
- Remove deprecated configs by @merrymercy in #1431
- [Feature] Support LoRA path renaming and add LoRA serving benchmarks by @Ying1123 in #1433
- Revert "[Minor] Raise exception for wrong import (#1409)" by @Ying1123 in #1432
- Add constrained_json_whitespace_pattern to ServerArgs by @zifeitong in #1438
- Clean up model loader by @merrymercy in #1440
- Simplify sampler and its error handling by @merrymercy in #1441
- [Feature, Hardware] Enable SGLang on AMD GPUs via PyTorch for ROCm by @HaiShaw in #1420
- Fix torch compile for deepseek-v2 by @ispobock in #1442
- Add OLMoE model by @janimo in #1444
- Release 0.3.1.post1 by @merrymercy in #1445
- Enable MLA by default by @ispobock in #1447
- Fix attention backend by @ispobock in #1448
- fix schedule bug by @hnyls2002 in #1450
- Fix schedule bug by @hnyls2002 in #1451
- Fixed n>1 causing list index out of range with VLM by @jasonyux in #1449
- Add bench_server_latency.py by @merrymercy in #1452
- [Bugfix] Enable SGLang on AMD GPUs via PyTorch for ROCm (#1419) by @HaiShaw in #1453
- Fix oom issues with fp8 for llama by @merrymercy in #1454
- Fuse top_k and top_k in the sampler by @merrymercy in #1457
- [Event] Add public meeting invite to README by @Ying1123 in #1458
- fix: creat new dict everytime for putting new frame by @Luodian in #1464
- Fix padding in the cuda graph by @merrymercy in #1469
- Release v0.3.1.post2 by @merrymercy in #1470
- Fix env vars in bench_latency by @merrymercy in #1472
- feat: update linear deps 1/N by @zhyncs in #1305
- minor: add quant eval compared with base by @zhyncs in #1475
- Add OLMoE by @Muennighoff in #1476
- Fix triton head num by @ispobock in #1482
- Add MLA gsm8k eval by @ispobock in #1484
- chore: bump v0.3.1.post3 by @zhyncs in #1483
- fix incorrect links in documentation by @rchen19 in #1481
- doc: update backend by @zhyncs in #1486
- Better unit tests for adding a new model by @merrymercy in #1488
- Pr fix max workers by @wellhowtosay in #1456
- Add a unit test for data parallelism by @merrymercy in #1489
- Add AMD tests to CI by @Ying1123 in #1491
- Update dockerfile to include datamodel_code_generator by @merrymercy in #1492
- [API, Feature] Support response prefill for openai API by @Ying1123 in #1490
- minor: add mla fp8 test by @zhyncs in #1494
- Fix the overhead due to penalizer in bench_latency by @merrymercy i...
Release v0.3.0
Highlights
Checkout the release blog post https://lmsys.org/blog/2024-09-04-sglang-v0-3/ to find detailed instructions and descriptions for the items below.
- Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)
- Up to 1.5x lower latency with torch.compile on small batch sizes
- Support for interleaved text and multi-image/video in LLaVA-OneVision
- Support for interleaved window attention and 2x longer context length in Gemma-2
- Chunked prefill is turned on by default (You can choose separate or mix prefill and decode).
- Add multi-GPU accuracy, performance test, and nightly accuracy test for more models.
What's Changed
- update hyperparameter guide by @merrymercy in #1114
- ci: compatible with fork repo by @zhyncs in #1115
- fix: resolve Python.h header missing by @zhyncs in #1119
- Fix the deadlock in multi-node tp by @merrymercy in #1122
- Mixed style of chunked prefill by @hnyls2002 in #1013
- Fix port conflicts between local CI and runner CI. by @hnyls2002 in #1131
- Fix CI accuracy && time out limit by @hnyls2002 in #1133
- fix: use fp16 dtype for sm75 by @zhyncs in #1136
- Improve the code style: more comments and remove useless packages by @merrymercy in #1139
- Improve benchmark by @merrymercy in #1140
- Fix duplicated imports in hf_transformers_utils.py by @merrymercy in #1141
- fixed a typo by @min-xu-et in #1143
- [Docs] Add instruction for running on clouds and kubernetes with SkyPilot by @Michaelvll in #1144
- [Feat]Add support for optional start len of logprobs by @yichuan520030910320 in #1035
- Optimize MLA/GQA/MQA Triton decoding by @ispobock in #1138
- feat: allow streaming for multi-prompt and/or parallel sampling by @vhain in #1134
- Improve docs and warnings by @merrymercy in #1164
- [Feature] add disable-custom-all-reduce by @Xu-Chen in #1148
- misc: add hypervisor vendor by @zhyncs in #1165
- support /v1/health using a generation 1 token by @LucienShui in #1154
- fix: resolve README render by @zhyncs in #1166
- [Feat] Support update weights without restart server by @shanyu-sys in #1157
- Improve multi-node stability by @merrymercy in #1171
- fix: custom op fallback forward native when lower sm80 by @zhyncs in #1177
- [Feature] Add a function to convert sampling_params to kwargs by @gryffindor-rr in #1170
- Support min-p sampling by @intervitens in #1167
- [Docs] Fix rendering of details in README by @Michaelvll in #1179
- Improve code style of sampler by @hnyls2002 in #1168
- [Minor] Improve logging and rename the health check endpoint name by @merrymercy in #1180
- Fix broken penalty by @hnyls2002 in #1184
- Fix benchmark script by @Ying1123 in #1185
- [Feat] add llava-onevision, with support for (1) siglip encoder, (2) qwen2 decoder (3) openai api compatible server. by @kcz358 in #1123
- feat: use gelu_tanh_and_mul by @zhyncs in #1193
- Cleanup readme, llava examples, usage examples and nccl init by @merrymercy in #1194
- Update README.md by @merrymercy in #1198
- [CI] Fix the problem of hf runner too slow by @Ying1123 in #1202
- [Fix] the issue of random order when input is a list by @Ying1123 in #1199
- Relax the assert in moe throughput test to fix the flaky CI by @merrymercy in #1207
- [Fix] Fixing the multi-images error for llava-onevision by @kcz358 in #1205
- Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model by @zhaochenyang20 in #1186
- [Minor] Improve the function organization in TokenizerManager & improve loggers by @merrymercy in #1208
- [Minor] Temporarily skip flaky test by @Ying1123 in #1209
- [CI] Fix the issue of unit test hanging by @Ying1123 in #1211
- Update CI workflows by @merrymercy in #1210
- Update CI runner docs by @merrymercy in #1213
- [Feature] Support fp8 e5m2 kv cache with flashinfer by @ispobock in #1204
- Update workflow files by @merrymercy in #1214
- improve the threshold and ports in tests by @wisclmy0611 in #1215
- [CI] Fix CI by @wisclmy0611 in #1217
- [Fix] Multi-images loading error by @kcz358 in #1218
- [Minor] improve CI and dependencies by @hnyls2002 in #1212
- [CI] Parallelize unit tests in CI by @wisclmy0611 in #1219
- Move sampler into CUDA graph by @hnyls2002 in #1201
- chore: bump v0.2.14 by @zhyncs in #1155
- [FEAT] JSON constrained support by @havetc in #1125
- Torch compile CI throughput test by @hnyls2002 in #1223
- [FEAT] Support batches cancel by @caiyueliang in #1222
- [Minor] add delete test and delete tmp file on ci server by @yichuan520030910320 in #1227
- [FIX] Wrong logger by @havetc in #1230
- feat: replace get_act_fn for gpt_bigcode by @zhyncs in #1231
- Fix readme by @ArtificialZeng in #1236
- Fix bench latency benchmark by @hnyls2002 in #1225
- [Minor] Add more type annotations by @merrymercy in #1237
- feat: support sm75 with FlashInfer v0.1.6 by @zhyncs in #1233
- Update README.md by @merrymercy in #1239
- hotfix: revert sampler CUDA Graph by @zhyncs in #1242
- Add sglang.bench_latency to CI by @merrymercy in #1243
- fix: increase max_new_tokens when testing generation models by @zhyncs in #1244
- feat: update GemmaRMSNorm by @zhyncs in #1232
- Fix llava on multi images by @merrymercy in #1247
- feat: replace GeluAndMul by @zhyncs in #1234
- fix: resolve qwen2 moe weight loader by @zhyncs in #1252
- chore: bump v0.2.14.post2 by @zhyncs in #1250
- make json_schema usable from gen by @qeternity in #1254
- fix data racing due to mutable reference using deepcopy by @xiezhq-hermann in #1255
- Sampler cudagraph by @hnyls2002 in #1253
- fix: multimodal_config in monkey_patch_vllm_dummy_weight_loader by @lxww302 in #1260
- Transpose mla weight offline by @ispobock in #1261
- EXAONE 3.0 Model Support by @Deepfocused in #1258
- Update README Support Exaone 3.0 by @Deepfocused in #1267
- Report median instead of mean in bench_latency.py by @merrymercy in #1269
- Allow more flexible assistant and system response by @BabyChouSr in #1256
- fix: resolve the fp8 bug introduced by vLLM 0.5.5 by @zhyncs in #1276
- [doc] fix quick start link by @ByronHsu in #1282
- Optimize the update flashinfer indices by @xiaobochen123 in #1262
- [CI] Add more multi-gpu tests by @merrymercy in #1280
- feat: fix fp8 for MLA and support bmm fp8 for DeepSeek V2 by @zhyncs in #1285
- [CI] merge all ci tests into one file by @merrymercy i...
Release v0.2.13
Highlights
- New Feature: Support window attention for Gemma-2 (#1056 #1090 #1112), enable chunked-prefill by default (#1040 #984), support all sampling penalties (#973)
- New Models: Support embedding model e5-mistral (#983 #987 #988 #997 #1014) and comprehensive OpenAI-compatible API.
- Performance: Accelerate Multi-head Latent Attention (MLA). Bring 2x end-to-end improvement on Deepseek v2 (#905).
- More CI Tests: Accuracy test (multiple benchmarks), unit test (APIs, model implementations), E2E test (high pressure test, performance test), MoE test
- Refactor and fix: More modular, better stability, use more kernels from flashinfer (#907)
What's Changed
- fix: set env in runner by @zhyncs in #891
- docs: update setup runner by @zhyncs in #884
- misc: update cuda graph capture exception log by @zhyncs in #894
- chore: add multipart dep for fastapi by @zhyncs in #895
- [minor] fixed code formatting doc by @min-xu-et in #896
- Bump version to 0.2.9.post1 by @Ying1123 in #899
- Update the base image of the docker by @Ying1123 in #900
- Reorder CI unit tests. by @hnyls2002 in #908
- fixed an error handling in bench_latency.py by @min-xu-et in #904
- Add model accuracy test - step 1 by @Ying1123 in #866
- latency test enhancement - part 1 by @min-xu-et in #909
- Improve the structure of CI by @Ying1123 in #911
- fix: use e2e and unit test only for original repo or pr by @zhyncs in #912
- misc: add triton in check_env PACKAGE_LIST by @zhyncs in #914
- Support MLA for DeepSeek-V2 with Triton - step 1 by @ispobock in #905
- enhance latency test - part 2 by @min-xu-et in #915
- Make API Key OpenAI-compatible by @Ying1123 in #917
- Update hyperparameter_tuning.md by @Ying1123 in #918
- Fix CI && python3.8 compatible by @hnyls2002 in #920
- Support more OpenAI API test by @yichuan520030910320 in #916
- Bump version to 0.2.10 by @Ying1123 in #923
- latency test enhancement - final part by @min-xu-et in #921
- Test openai vision api by @Ying1123 in #925
- Test regex in vision api by @Ying1123 in #926
- Update README.md by @Ying1123 in #927
- Fix prompt len in parallel sampling by @yichuan520030910320 in #928
- docs: update README by @zhyncs in #935
- Remove leftover auth_token by @AidanCooper in #934
- Feat: add alternative choices selection methods by @AidanCooper in #835
- Fix union operator by @ispobock in #940
- Support multiple args options by @yichuan520030910320 in #941
- Fix stuck in
get_new_prefill_batch
by @hnyls2002 in #948 - Organize code (rename, movement) by @hnyls2002 in #953
- fix nsys cannot profile cuda kernel by @mpjlu in #957
- Add support for Batch API test by @yichuan520030910320 in #936
- Show more error messages for warmup errors by @Ying1123 in #932
- misc: update issue template by @zhyncs in #963
- misc: simplify test by @yichuan520030910320 in #964
- misc: add compute capability in check_env by @zhyncs in #965
- Make
req_pool_indices
on CPU by @hnyls2002 in #960 - misc: fix the req_to_token member change by @hnyls2002 in #967
- chore: update vllm to 0.5.4 by @zhyncs in #966
- chore: bump v0.2.11 by @zhyncs in #970
- Purge self-runner's pip cache weekly by @hnyls2002 in #975
- Run purge-cache only in sgl-project by @hnyls2002 in #976
- misc: correct the int data type for token ids and indices by @xiezhq-hermann in #969
- PrefillAdder abstraction by @hnyls2002 in #968
- RadixCache method adjust by @hnyls2002 in #977
- Adjust max prefix len by @hnyls2002 in #980
- #590 Increase default , track changes in examples and documentation by @foszto in #971
- [minor] Update type annotation in tokenizer_manager.py by @Ying1123 in #982
- Fix chunked prefill by @hnyls2002 in #984
- Add llama embedding modules [unreachable code] - step 1/3 by @Ying1123 in #983
- Add io struct for embedding models [unreachable code] - step 2/3 by @Ying1123 in #987
- Adjust
InputeMetadata
andScheduleBatch
by @hnyls2002 in #981 - support more optioin about usage in stream mode by @yichuan520030910320 in #985
- Create contributor_guide.md by @Ying1123 in #992
- feat: frequency, min_new_tokens, presence, and repetition penalties by @vhain in #973
- Move torch.compile configs into cuda_graph_runner.py by @Ying1123 in #993
- Add e5-mistral embedding model - step 3/3 by @Ying1123 in #988
- test: negative value testing for frequency, presence penalizers by @vhain in #995
- support models from www.modelscope.cn by @liuyhwangyh in #994
- bugfix: penalizers to be merged before reqs by @vhain in #1001
- fix: resolve correctness_test issue by @zhyncs in #1002
- Minor bugfix on benchmark serving by @ywang96 in #1005
- Add openai embedding API by @Ying1123 in #997
- Add skip_tokenizer_init args. by @gryffindor-rr in #959
- Fix benchmark latency by @wisclmy0611 in #1007
- Some warnings to crash when CI by @hnyls2002 in #1009
- Reduce the overhead when cache is disabled by @hnyls2002 in #1010
- Support embedding input as a list by @Ying1123 in #1014
- misc: update test config by @zhyncs in #990
- fix: force max new tokens to be 1 for embedding request by @Ying1123 in #1019
- Clean up unit tests by @merrymercy in #1020
- Fix
input_ids
&& rename tofill_ids
by @hnyls2002 in #1021 - feat: use FlashInfer rmsnorm and silu by @zhyncs in #907
- misc: update issue template by @zhyncs in #1024
- Clean up readme and arguments of chunked prefill by @merrymercy in #1022
- Fix wrong assert by @hnyls2002 in #1028
- Improve type annotation by @merrymercy in #1029
- hotfix: add CustomOp abstraction by @zhyncs in #1027
- Fix the case where r.prefix_indices is None by @merrymercy in #1031
- Fix triton args init by @hnyls2002 in #1034
- Fix the case when max_new_tokens is too large by @merrymercy in #1025
- Test the case when max_new_tokens is very large by @merrymercy in #1038
- Fix the prefix indices by @hnyls2002 in #1037
- Improve end-to-end throughput test and its coverage by @merrymercy in #1039
- Delete the useless test/srt/test_throughput.py by @merrymercy in #1045
- minor: some potential bugs by @hnyls2002 in #1044
- Clean up the comments and names under python/sglang/srt/layers by @merrymercy in #1047
- fix...
Release v0.2.9
Highlights
- New feature: Chunked prefill (#800, #811)
- New models: Deepseek v2
- Performance improvement: vectorized logprob computation
- Accuracy fix: fix the double BOS problem in the chat template; move logits to float32; update flashinfer sampling kernels
- Feature fix: fixed many missing logprob-related features in the OpenAI API server
- CI/CD infra is now fully ready. The tests cover frontend, backend, accuracy, and performance tests.
What's Changed
- Deepseek v2 support by @hnyls2002 in #693
- Fix context length by @hnyls2002 in #757
- docs: update model support by @zhyncs in #760
- fix: not run workflows on fork repo by @zhyncs in #762
- Update supported models by @hnyls2002 in #763
- Fix TransformerTokenizer init for chatglm2 & 3 by @ispobock in #761
- [Minor] Improve the code style in TokenizerManager by @merrymercy in #767
- Update readme by @Ying1123 in #769
- feat: add fake tag by @zhyncs in #770
- Fix max_tokens for OpenAI chat completion API by @merrymercy in #766
- Fix max new tokens by @merrymercy in #772
- Move sampling logits to float32 by @merrymercy in #773
- minor refactor: move check server args to server_args.py by @wisclmy0611 in #774
- Fix return_log_probs with cuda graph by @merrymercy in #775
- Rename prefill_token_logprobs -> input_token_logprobs; decode_token_logprobs -> output_token_logprobs by @merrymercy in #776
- Allow disabling flashinfer sampling kernel by @merrymercy in #778
- Bump version to 0.2.6 by @merrymercy in #779
- fix: replace pillow with PIL in PACKAGE_LIST by @zhyncs in #781
- docs: init readthedocs support by @zhyncs in #783
- fix: init readthedocs support by @zhyncs in #784
- fix: exclude logo png in gitignore by @zhyncs in #785
- docs: update index by @zhyncs in #786
- Vectorize logprobs computation by @Ying1123 in #787
- docs: update README by @zhyncs in #788
- docs: make badges center by @zhyncs in #789
- chore: add copyright for srt by @zhyncs in #790
- Fix echo + lobprob for OpenAI API when the prompt is a list by @Ying1123 in #791
- Update README.md by @Ying1123 in #792
- Lazy-import third-party backends by @bgyoon in #794
- Fix lazy import location by @Ying1123 in #795
- Fix logging by @Ying1123 in #796
- Add role documentation, add system begin & end tokens by @objnf-dev in #793
- Chunked prefill support by @hnyls2002 in #797
- Revert "Chunked prefill support" by @Ying1123 in #799
- Chunked prefill by @hnyls2002 in #800
- fix: update flashinfer to 0.1.2 to fix sampling for cu118 by @zhyncs in #803
- Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118" by @Ying1123 in #805
- feat: add chat template for internlm2-chat by @zhyncs in #802
- Revert "Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118"" by @Ying1123 in #806
- Add support for OpenAI API : offline batch(file) processing by @yichuan520030910320 in #699
- Organize public APIs by @hnyls2002 in #809
- Remove inf value for chunked prefill size by @hnyls2002 in #812
- Revert "Organize public APIs" by @Ying1123 in #815
- fix: use v0.2.5 for benchmark by @zhyncs in #814
- Fix LiteLLM kwargs by @qeternity in #817
- Code structure refactor by @hnyls2002 in #807
- docs: update README by @zhyncs in #819
- Fix streaming bug by @objnf-dev in #820
- feat: add runner by @zhyncs in #821
- feat: add pr e2e test by @zhyncs in #822
- Support disable_ignore_eos in bench_serving.py by @Ying1123 in #824
- Adjust default mem fraction to avoid OOM by @Ying1123 in #823
- Add awq_marlin by @Ying1123 in #826
- misc: update e2e test benchmark config by @zhyncs in #825
- misc: enable e2e test when push by @zhyncs in #828
- docs: add set up runner by @zhyncs in #829
- chore: bump v0.2.7 by @zhyncs in #830
- Add
--max-total-tokens
by @hnyls2002 in #840 - Fix List input bug by @yichuan520030910320 in #838
- Add req slots leaking check by @hnyls2002 in #842
- docs: update README.md by @eltociear in #843
- misc: update e2e test paths config by @zhyncs in #848
- chore: update flashinfer to v0.1.3 by @zhyncs in #850
- Fix llama for classification by @Ying1123 in #855
- Add troubleshooting doc by @Ying1123 in #856
- Fix #857 by @kaifronsdal in #858
- Add support for logprobs in OpenAI chat API by @yichuan520030910320 in #852
- Support chunked prefill when radix cache is disabled by @hnyls2002 in #811
- misc: update e2e test paths config by @zhyncs in #860
- Rename github workflows by @Ying1123 in #861
- misc: disable auto release by @zhyncs in #862
- misc: add cancel previous at e2e by @zhyncs in #864
- Add OpenAI backend to the CI test by @Ying1123 in #869
- Fix openai CI tests by @Ying1123 in #870
- misc: use pip cache purge and add unit test ci by @zhyncs in #871
- misc: update unit test config by @zhyncs in #873
- Fix unit tests for the frontend language part by @Ying1123 in #872
- bump to 0.2.8 by @Ying1123 in #877
- Make scripts under
/test/srt
as unit tests by @Ying1123 in #875 - Update runner docs by @hnyls2002 in #876
- Improve the coverage of the openai api server test by @Ying1123 in #878
- Implement served_model_name to customize model id when use local mode… by @dionren in #749
- Update runner docs by @hnyls2002 in #879
- Add more unit tests to CI by @Ying1123 in #880
- Add accuracy test to CI: MMLU by @Ying1123 in #882
- Update workflow name by @Ying1123 in #883
- Fix the double BOS problem in the HF chat template by @Ying1123 in #888
- Add benchmark: HumanEval by @Ying1123 in #889
- Increase openai client limit by @Ying1123 in #886
- Bump version to v0.2.9 by @Ying1123 in #890
New Contributors
- @bgyoon made their first contribution in #794
- @objnf-dev made their first contribution in #793
- @kaifronsdal made their first contribution in #858
- @dionren made their first contribution in #749
Full Changelog: v0.2.5...v0.2.9
Release v0.2.5
Highlights
-
We recently released a blog. Compared to TensorRT-LLM and vLLM, SGLang Runtime consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, and on A100 and H100 GPUs, using FP8 and FP16. SGLang consistently outperforms vLLM, achieving up to 3.1x higher throughput on Llama-70B. It also often matches or sometimes outperforms TensorRT-LLM.
-
We have now automated the release processes for PyPI, Docker, and Release using GitHub workflows. Previously, because Release was not automated, GitHub Tags were not updated in time, leading to a jump from v0.2.0 directly to v0.2.5.
-
Welcome everyone to try using https://github.com/sgl-project/sglang, and also welcome everyone to actively participate in the community, including but not limited to issues, PRs, and discussions. Cheers!
Release v0.2.0
Highlights
- We performed extensive engineering to improve the base performance. Compared to TensorRT-LLM and vLLM, SGLang now consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, on A100 and H100 GPUs, using FP8 and FP16. See the latest blog.
- New models: Llama3 405B, Deepseek MoE, InternLM, GPTBigCode, Mistral-Nemo
What's Changed
- Optimize mem indices mangement by @hnyls2002 in #619
- Unify index operations by @hnyls2002 in #620
- Simplify mem state by @wisclmy0611 in #623
- Improve tensor parallel performance by @Ying1123 in #625
- Bump version to 0.1.21 by @Ying1123 in #626
- Fix model forward grad by @hnyls2002 in #628
- Update docker file by @Ying1123 in #629
- Disable NCCL_NVLS by default by @Ying1123 in #631
- Add qwen2 tie word embedding by @yileld in #630
- Add support for VertexAI safety settings by @AidanCooper in #624
- Fix vertexai by @hnyls2002 in #633
- Reduce docker size by @hnyls2002 in #632
- clean up step function by @Ying1123 in #635
- feat: support internlm2 by @zhyncs in #636
- misc: add pre-commit config by @zhyncs in #637
- misc: add issue and pr template by @zhyncs in #638
- Flashinfer sample kernel by @hnyls2002 in #617
- Move
global_server_args_dict
by @hnyls2002 in #642 - Increase the capacity of the memory pool by @Ying1123 in #643
- feat: add check_env by @zhyncs in #645
- Remove the dependency of rpyc by @wisclmy0611 in #646
- misc: rm rpyc from PACKAGE_LIST by @zhyncs in #649
- fix: set ulimit -n 65535 by @zhyncs in #647
- feat: add lint workflow by @zhyncs in #648
- fix: resolve lint error by @zhyncs in #650
- Remove useless variables in infer_batch.py by @Ying1123 in #651
- Detokenize incrementally when streaming by @hnyls2002 in #653
TokenizerManager.context_len
should inherit from `server_args.conte… by @shrirajh in #654- Remove cached triton launcher by @merrymercy in #656
- perf: reduce ttft and itl with stream_interval 1 by @zhyncs in #658
- feat: add benchmark serving by @zhyncs in #657
- refactor model loader [unreachable code]: initial refactor by @Ying1123 in #655
- misc: update SGLang package description by @zhyncs in #659
- Update Readme by @Ying1123 in #660
- feat: update check env by @zhyncs in #661
- Improve docs by @Ying1123 in #662
- Add benchmark instructions by @Ying1123 in #663
- Fix jump forward when streaming by @hnyls2002 in #665
- Fix kill process util by @ispobock in #666
- Add support for OpenAI API parallel sampling by @yichuan520030910320 in #640
- Update OpenAI API by @wisclmy0611 in #667
- Temporary fix invalid sample results by @hnyls2002 in #668
- Support random dataset in bench_serving.py by @merrymercy in #669
- Revert "Temporary fix invalid sample results" by @hnyls2002 in #673
- refactor model loader: initial refactor by @Ying1123 in #664
- Fix cuda graph with flashinfer by @merrymercy in #675
- Tmp fix illegal sample by @hnyls2002 in #676
- Update version to 0.1.22 by @Ying1123 in #677
- Fallback when sampling failed by @ispobock in #678
- feat: support TRT LLM benchmark and multiple benchmarks by @zhyncs in #670
- Decouple kv by @hnyls2002 in #679
- Support gpt-bigcode model class by @hnyls2002 in #681
- support non-streaming benchmark by @merrymercy in #682
- Fix StreamExecutor.fork() losing the current role start index. by @max99x in #684
- feat: update bench serving by @zhyncs in #685
- misc: update output file logic by @zhyncs in #686
- Allow disabling streaming in bench by @merrymercy in #687
- docs: update README by @zhyncs in #688
- Support Deepseek MoE Model by @hnyls2002 in #689
- misc: recommend to use chat model for benchmark by @zhyncs in #690
- Support Mistral-Nemo by @ispobock in #691
- docs: update README by @zhyncs in #692
- fix: update bench serving by @zhyncs in #694
- misc: update output token logic by @zhyncs in #695
- Tune params by @Ying1123 in #696
- Fix trt benchmark by @Ying1123 in #697
- misc: fix typo by @zhyncs in #698
- Fix flashinfer by @Ying1123 in #700
- Fix hf config loading by @ispobock in #702
- Use min new token ratio at start by @hnyls2002 in #701
- feat: add e2e latency by @zhyncs in #704
- Update vllm version to support llama3.1 by @Ying1123 in #705
- bump version to 0.1.23 by @Ying1123 in #706
- Reduce hardcoded logic of kernel usage by @wisclmy0611 in #707
- Fix multi-node deadlock by @merrymercy in #709
- Auto adjust new ratio by @hnyls2002 in #708
- Fix prefill size by @Ying1123 in #711
- docs: update README by @zhyncs in #712
- docs: update doc by @zhyncs in #713
- fix: llama 3.1 405b fp8 by @zhyncs in #714
- misc: update doc by @zhyncs in #715
- Improve benchmark scripts by @Ying1123 in #717
- Bump version to 0.1.24 by @Ying1123 in #718
- docs: update supported models by @zhyncs in #719
- docs: update comment by @zhyncs in #721
- chore: add close inactive issues workflow by @zhyncs in #722
- misc: update bulid instruction by @zhyncs in #724
- fix: fp8 config by @Ying1123 in #723
- Fix dockerfile and triton cache manager by @hnyls2002 in #720
- chore: bump v0.1.25 by @zhyncs in #725
- fix: resolve the logo display issue on the PyPI page by @zhyncs in #726
- misc: update bug issue template by @zhyncs in #727
- Revert "fix: fp8 config" by @Ying1123 in #728
- Fix bugs (fp8 checkpoints, triton cache manager) by @Ying1123 in #729
- Bump version to 0.2.0 by @Ying1123 in #730
New Contributors
- @yileld made their first contribution in #630
- @AidanCooper made their first contribution in #624
- @zhyncs made their first contribution in #636
- @shrirajh made their first contribution in #654
- @yichuan520030910320 made their first contribution in https://github.com/...