Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] OOM when benchmarking #810

Closed
3 tasks done
lxww302 opened this issue Jul 29, 2024 · 6 comments
Closed
3 tasks done

[Bug] OOM when benchmarking #810

lxww302 opened this issue Jul 29, 2024 · 6 comments

Comments

@lxww302
Copy link
Contributor

lxww302 commented Jul 29, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

out of memory encountered

Exception in ModelTpServer: Traceback (most recent call last): File "/opt/tiger/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 209, in exposed_step self.forward_step() File "/home/tiger/.local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/opt/tiger/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 240, in forward_step self.forward_decode_batch(self.running_batch) File "/opt/tiger/sglang/python/sglang/srt/managers/controller/tp_worker.py", line 650, in forward_decode_batch output = self.model_runner.forward(batch, ForwardMode.DECODE) File "/opt/tiger/sglang/python/sglang/srt/managers/controller/model_runner.py", line 364, in forward return self.forward_decode(batch) File "/home/tiger/.local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/opt/tiger/sglang/python/sglang/srt/managers/controller/model_runner.py", line 317, in forward_decode return self.model.forward( File "/home/tiger/.local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/opt/tiger/sglang/python/sglang/srt/models/llama2.py", line 331, in forward return self.logits_processor( File "/home/tiger/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/tiger/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/opt/tiger/sglang/python/sglang/srt/layers/logits_processor.py", line 164, in forward last_logits = last_logits[:, : self.config.vocab_size].float() torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.25 GiB. GPU

Reproduction

serving:
python3 -m sglang.launch_server --model-path neuralmagic/Meta-Llama-3-70B-Instruct-FP8 --disable-radix-cache --tp 8
benchmarking:
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 6000 --random-input 256 --random-output 512 --output-file offline.jsonl

Environment

Python: 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.2, V12.2.140
CUDA Driver Version: 535.129.03
535.129.03
535.129.03
535.129.03
535.129.03
535.129.03
535.129.03
535.129.03
PyTorch: 2.3.1+cu121
flashinfer: 0.1.2+cu121torch2.3
requests: 2.32.3
tqdm: 4.66.4
numpy: 1.26.4
aiohttp: 3.9.5
fastapi: 0.111.1
hf_transfer: 0.1.8
huggingface_hub: 0.24.2
interegular: 0.3.3
packaging: 24.0
PIL: 10.3.0
psutil: 6.0.0
pydantic: 2.8.2
uvicorn: 0.30.3
uvloop: 0.19.0
zmq: 26.0.3
vllm: 0.5.3.post1
openai: 1.37.1
anthropic: 0.31.2
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    0-51,104-155    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    0-51,104-155    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    0-51,104-155    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    0-51,104-155    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    52-103,156-207  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    52-103,156-207  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    52-103,156-207  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      52-103,156-207  1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

ulimit soft: 1024768
@Ying1123
Copy link
Member

Ying1123 commented Jul 29, 2024

Hi @lxww302, could you try using a smaller value for mem-frac by --mem-frac 0.8 when launching server? Higher is more aggressive, which could give you higher throughput but higher risk of OOM.

@hnyls2002
Copy link
Collaborator

@lxww302 You can expect using --chunked-prefill-size 8192 with radix cache disabled soon. I am working on this.

@zhyncs
Copy link
Member

zhyncs commented Jul 30, 2024

Hi @lxww302 Sorry for the inconvenience, may you try using version v0.2.5? Thanks.
https://github.com/sgl-project/sglang/pull/814/files

git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout v0.2.5

pip install --upgrade pip
pip install -e "python[all]"

pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/

@zhyncs
Copy link
Member

zhyncs commented Jul 30, 2024

I tried v0.2.5 on the H100x8, and it worked for me.

# server
python3 -m sglang.launch_server --model-path neuralmagic/Meta-Llama-3-70B-Instruct-FP8 --disable-radix-cache --tp 8

# client
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 6000 --random-input 256 --random-output 512 --output-file offline.jsonl
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 545.23.08
545.23.08
545.23.08
545.23.08
545.23.08
545.23.08
545.23.08
545.23.08
PyTorch: 2.3.1+cu121
sglang: 0.2.5
flashinfer: 0.1.2+cu121torch2.3
requests: 2.32.3
tqdm: 4.66.4
numpy: 1.26.3
aiohttp: 3.9.5
fastapi: 0.111.1
hf_transfer: 0.1.8
huggingface_hub: 0.24.3
interegular: 0.3.3
packaging: 23.2
pillow: Module Not Found
psutil: 5.9.8
pydantic: 2.8.2
uvicorn: 0.30.3
uvloop: 0.19.0
zmq: 24.0.1
vllm: 0.5.3.post1
openai: 1.37.1
anthropic: 0.32.0
NVIDIA Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    CPU Affinity       NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-47,96-143        0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-47,96-143        0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-47,96-143        0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     0-47,96-143        0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     48-95,144-191      1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     48-95,144-191      1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     48-95,144-191      1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     48-95,144-191      1               N/A
NIC0    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC2    SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC3    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     SYS     SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      SYS     SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS
NIC7    SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS
NIC8    SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS
NIC9    SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9


ulimit soft: 1048576
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     6000
Benchmark duration (s):                  193.23
Total input tokens:                      767114
Total generated tokens:                  1547384
Total generated tokens (retokenized):    1537961
Request throughput (req/s):              31.05
Input token throughput (tok/s):          3969.93
Output token throughput (tok/s):         8007.95
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   131745.67
Median E2E Latency (ms):                 144451.28
---------------Time to First Token----------------
Mean TTFT (ms):                          33387.84
Median TTFT (ms):                        20707.56
P99 TTFT (ms):                           125340.21
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          498.73
Median TPOT (ms):                        417.23
P99 TPOT (ms):                           2383.27
---------------Inter-token Latency----------------
Mean ITL (ms):                           567.61
Median ITL (ms):                         252.21
P99 ITL (ms):                            1319.41
==================================================

@Ying1123
Copy link
Member

fixed by #823

@zhyncs
Copy link
Member

zhyncs commented Jul 30, 2024

We will release v0.2.7 soon and the performance becomes better!

8007.95 -> 9280.91

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     6000
Benchmark duration (s):                  166.73
Total input tokens:                      767114
Total generated tokens:                  1547384
Total generated tokens (retokenized):    1538674
Request throughput (req/s):              35.99
Input token throughput (tok/s):          4601.00
Output token throughput (tok/s):         9280.91
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   116566.98
Median E2E Latency (ms):                 126900.46
---------------Time to First Token----------------
Mean TTFT (ms):                          33167.51
Median TTFT (ms):                        21921.46
P99 TTFT (ms):                           111985.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          433.31
Median TPOT (ms):                        351.04
P99 TPOT (ms):                           2233.24
---------------Inter-token Latency----------------
Mean ITL (ms):                           489.94
Median ITL (ms):                         204.58
P99 ITL (ms):                            1165.85
==================================================

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants