-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] OOM when benchmarking #810
Comments
Hi @lxww302, could you try using a smaller value for mem-frac by |
@lxww302 You can expect using |
Hi @lxww302 Sorry for the inconvenience, may you try using version v0.2.5? Thanks. git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout v0.2.5
pip install --upgrade pip
pip install -e "python[all]"
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/ |
I tried v0.2.5 on the H100x8, and it worked for me. # server
python3 -m sglang.launch_server --model-path neuralmagic/Meta-Llama-3-70B-Instruct-FP8 --disable-radix-cache --tp 8
# client
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 6000 --random-input 256 --random-output 512 --output-file offline.jsonl
|
fixed by #823 |
We will release v0.2.7 soon and the performance becomes better! 8007.95 -> 9280.91
|
Checklist
Describe the bug
out of memory encountered
Exception in ModelTpServer: Traceback (most recent call last): File "/opt/tiger/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 209, in exposed_step self.forward_step() File "/home/tiger/.local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/opt/tiger/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 240, in forward_step self.forward_decode_batch(self.running_batch) File "/opt/tiger/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 650, in forward_decode_batch output = self.model_runner.forward(batch, ForwardMode.DECODE) File "/opt/tiger/sglang/python/sglang/srt/managers/controller/model_runner.py", line 364, in forward return self.forward_decode(batch) File "/home/tiger/.local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/opt/tiger/sglang/python/sglang/srt/managers/controller/model_runner.py", line 317, in forward_decode return self.model.forward( File "/home/tiger/.local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/opt/tiger/sglang/python/sglang/srt/models/llama2.py", line 331, in forward return self.logits_processor( File "/home/tiger/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/tiger/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/opt/tiger/sglang/python/sglang/srt/layers/logits_processor.py", line 164, in forward last_logits = last_logits[:, : self.config.vocab_size].float() torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.25 GiB. GPU
Reproduction
serving:
python3 -m sglang.launch_server --model-path neuralmagic/Meta-Llama-3-70B-Instruct-FP8 --disable-radix-cache --tp 8
benchmarking:
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 6000 --random-input 256 --random-output 512 --output-file offline.jsonl
Environment
The text was updated successfully, but these errors were encountered: