[Bug] generate wrong sequences with higher temperature #771

StevenZHB · 2024-07-27T19:28:23Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

I started a vllm server and a sglang server with the same model. I found sglang server output unreadable tokens with high temeratur while vllm not.
Example:
sampling_params = {"temperature":1,"n":1}
sglang server: 'Expossible! description! description!爽 ipairs soc!!爽爽爽爽'
vllm server: "Let's start by using the given information to set up three equations:\n1."

sampling_params = {"temperature":0.2,"n":1}
sglang server: "Let's start by using the given information to set up three equations:\n1."
vllm server: "Let's start by using the given information to set up three equations:\n1."

maybe it's related to #523 , I don't know how to fix it

Reproduction

CUDA_VISIBLE_DEVICES=0 nohup python3 -m sglang.launch_server --model-path llama3-8B-instruct --port 9554 --disable-cuda-graph --mem-fraction-static 0.75 --max-prefill-tokens 12800

Environment

python3 -m sglang.check_env
Python: 3.9.17 (main, Jul  5 2023, 20:41:20) [GCC 11.2.0]
CUDA available: True
GPU 0,1: NVIDIA A800-SXM4-80GB
CUDA_HOME: /zhanghongbo/CUDA/cuda-11_8
NVCC: Cuda compilation tools, release 11.8, V11.8.89
CUDA Driver Version: 525.105.17
525.105.17
PyTorch: 2.3.0+cu118
flashinfer: 0.1.1+cu118torch2.3
requests: 2.31.0
tqdm: 4.66.1
numpy: 1.25.0
aiohttp: 3.8.5
fastapi: 0.110.0
hf_transfer: Module Not Found
huggingface_hub: 0.23.2
interegular: 0.3.3
packaging: 24.0
pillow: Module Not Found
psutil: 5.9.8
pydantic: 2.5.0
uvicorn: 0.23.2
uvloop: 0.19.0
zmq: 25.1.2
vllm: 0.5.3.post1
openai: 1.30.0
anthropic: Module Not Found
litellm: Module Not Found
NVIDIA Topology: 
        GPU0    GPU1    NIC0    CPU Affinity    NUMA Affinity
GPU0     X      NV8     SYS     48-63   3
GPU1    NV8      X      SYS     48-63   3
NIC0    SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0


ulimit soft: 1048576

merrymercy · 2024-07-28T00:37:27Z

Thanks for reporting this. I can reproduce the error with llama-3-8b.

One possible reason is that we previously stored the logits in bfloat16. This can cause some problems, but it has been fixed by PR #773. Could you try the main branch again?

After PR #733, temp=1 works great for me. However, I still get random output with temp=3. I guess this is expected because llama-3 has a vocab size of 128K, and temp=3 flattens the logits too much, making softmax/sampling unstable at this point.

ispobock · 2024-07-28T02:19:37Z

Hi @merrymercy , the issue seems still there when I use the flashinfer sampling kernel. But it's fine when I switch to the fallback torch implementation.

python3 -m sglang.launch_server --model-path /workdir/llm_models/Meta-Llama-3-8B --port 30000 --trust-remote-code

curl -X POST http://0.0.0.0:30000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/workdir/llm_models/Meta-Llama-3-8B",
"prompt": "Please introduce yourself",
"max_tokens": 100,
"stream": false,
"temperature": 1
}'

env:

cuda 11.8
flashinfer: 0.1.1+cu118torch2.3 & main branch
torch: 2.3.1+cu118
sglang: main branch

cc: @yzh119

yzh119 · 2024-07-28T02:56:35Z

This should be a bug in flashinfer sampling kernel, thanks for reporting.

merrymercy · 2024-07-28T03:20:12Z

@StevenZHB Please try the main branch first to see if putting logits to float32 resolves your issue (#773).
If you still see wrong outputs, please try to add --disable-flashinfer-sampling when you launch the server (#778).

StevenZHB · 2024-07-28T03:42:06Z

@merrymercy Thanks, I have tried with the main branch. It still generates wrong outputs. But it can be resolved by setting --disable-flashinfer-sampling. So maybe it's a bug in flashinfer.

merrymercy · 2024-07-28T11:02:17Z

@yzh119 This seems like a critical bug. Could you prioritize a fix?

yzh119 · 2024-07-28T22:51:17Z

Yes I'll release v0.1.2 tonight.

yzh119 · 2024-07-29T00:09:33Z

I just noticed that the user is using cu118, the bug was already fixed in flashinfer-ai/flashinfer#386

ispobock · 2024-07-29T22:56:24Z

@yzh119 I tried flashinfer 0.1.2, but still got the issue. This issue is different from flashinfer-ai/flashinfer#384, could you help check?
@StevenZHB could you verify flashinfer 0.1.2 on your environment?

yzh119 · 2024-07-29T23:22:52Z

Sure I can check, could you help dump the logits score so that I can debug on them directly?

yzh119 · 2024-07-30T00:51:46Z

Bug reproduced, I think I have figured out the bug, v0.1.3 will fix the issue.

Related issue: sgl-project/sglang#771 This PR fixes the usage of `FlagHeads` cub API in sampling kernels. As [documented](https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockDiscontinuity.html), the default FlagHeads api will always flag the first element, which is not expected when first element is not `true`. > For thread0, item input[0] is always flagged. This PR sets the `tile_predecessor_item` argument (to 0) which will be compared against input[0]. CUDA 12+ don't have this issue because we are using the new `SubtractLeft` API instead of `FlagHeads`.

yzh119 · 2024-07-30T11:41:56Z

Fixed in flashinfer-ai/flashinfer#410

zhyncs · 2024-08-01T07:13:43Z

Hi @StevenZHB May you try the latest main branch? It should be ok now. Thanks.

git clone https://github.com/sgl-project/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/

StevenZHB · 2024-08-02T04:37:43Z

Thanks a lot! I tested the latest version, and everything is working perfectly now.

merrymercy added the bug Something isn't working label Jul 27, 2024

merrymercy mentioned this issue Jul 28, 2024

Allow disabling flashinfer sampling kernel #778

Merged

zhyncs mentioned this issue Jul 29, 2024

fix: update flashinfer to 0.1.2 to fix sampling for cu118 #803

Merged

Ying1123 closed this as completed in #803 Jul 29, 2024

yzh119 mentioned this issue Jul 30, 2024

bugfix: fix cu118 cub usage flashinfer-ai/flashinfer#410

Merged

zhyncs mentioned this issue Jul 31, 2024

chore: update flashinfer to v0.1.3 #850

Merged

Broyojo mentioned this issue Nov 19, 2024

[Bug] Qwen-2.5-Math-7B-Instruct and Llama-3.1-8B-Instruct Produce Nonsensical Results #2084

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] generate wrong sequences with higher temperature #771

[Bug] generate wrong sequences with higher temperature #771

StevenZHB commented Jul 27, 2024

merrymercy commented Jul 28, 2024

ispobock commented Jul 28, 2024

yzh119 commented Jul 28, 2024

merrymercy commented Jul 28, 2024 •

edited

Loading

StevenZHB commented Jul 28, 2024 •

edited by merrymercy

Loading

merrymercy commented Jul 28, 2024 •

edited

Loading

yzh119 commented Jul 28, 2024

yzh119 commented Jul 29, 2024

ispobock commented Jul 29, 2024

yzh119 commented Jul 29, 2024

yzh119 commented Jul 30, 2024

yzh119 commented Jul 30, 2024

zhyncs commented Aug 1, 2024

StevenZHB commented Aug 2, 2024

[Bug] generate wrong sequences with higher temperature #771

[Bug] generate wrong sequences with higher temperature #771

Comments

StevenZHB commented Jul 27, 2024

Checklist

Describe the bug

Reproduction

Environment

merrymercy commented Jul 28, 2024

ispobock commented Jul 28, 2024

yzh119 commented Jul 28, 2024

merrymercy commented Jul 28, 2024 • edited Loading

StevenZHB commented Jul 28, 2024 • edited by merrymercy Loading

merrymercy commented Jul 28, 2024 • edited Loading

yzh119 commented Jul 28, 2024

yzh119 commented Jul 29, 2024

ispobock commented Jul 29, 2024

yzh119 commented Jul 29, 2024

yzh119 commented Jul 30, 2024

yzh119 commented Jul 30, 2024

zhyncs commented Aug 1, 2024

StevenZHB commented Aug 2, 2024

merrymercy commented Jul 28, 2024 •

edited

Loading

StevenZHB commented Jul 28, 2024 •

edited by merrymercy

Loading

merrymercy commented Jul 28, 2024 •

edited

Loading