Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] generate wrong sequences with higher temperature #771

Closed
2 of 3 tasks
StevenZHB opened this issue Jul 27, 2024 · 14 comments · Fixed by #803 or #850
Closed
2 of 3 tasks

[Bug] generate wrong sequences with higher temperature #771

StevenZHB opened this issue Jul 27, 2024 · 14 comments · Fixed by #803 or #850
Labels
bug Something isn't working

Comments

@StevenZHB
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

I started a vllm server and a sglang server with the same model. I found sglang server output unreadable tokens with high temeratur while vllm not.
Example:
sampling_params = {"temperature":1,"n":1}
sglang server: 'Expossible! description! description!爽 ipairs soc!!爽爽爽爽'
vllm server: "Let's start by using the given information to set up three equations:\n1."

sampling_params = {"temperature":0.2,"n":1}
sglang server: "Let's start by using the given information to set up three equations:\n1."
vllm server: "Let's start by using the given information to set up three equations:\n1."

maybe it's related to #523 , I don't know how to fix it

Reproduction

CUDA_VISIBLE_DEVICES=0 nohup python3 -m sglang.launch_server --model-path llama3-8B-instruct --port 9554 --disable-cuda-graph --mem-fraction-static 0.75 --max-prefill-tokens 12800

Environment

python3 -m sglang.check_env
Python: 3.9.17 (main, Jul  5 2023, 20:41:20) [GCC 11.2.0]
CUDA available: True
GPU 0,1: NVIDIA A800-SXM4-80GB
CUDA_HOME: /zhanghongbo/CUDA/cuda-11_8
NVCC: Cuda compilation tools, release 11.8, V11.8.89
CUDA Driver Version: 525.105.17
525.105.17
PyTorch: 2.3.0+cu118
flashinfer: 0.1.1+cu118torch2.3
requests: 2.31.0
tqdm: 4.66.1
numpy: 1.25.0
aiohttp: 3.8.5
fastapi: 0.110.0
hf_transfer: Module Not Found
huggingface_hub: 0.23.2
interegular: 0.3.3
packaging: 24.0
pillow: Module Not Found
psutil: 5.9.8
pydantic: 2.5.0
uvicorn: 0.23.2
uvloop: 0.19.0
zmq: 25.1.2
vllm: 0.5.3.post1
openai: 1.30.0
anthropic: Module Not Found
litellm: Module Not Found
NVIDIA Topology: 
        GPU0    GPU1    NIC0    CPU Affinity    NUMA Affinity
GPU0     X      NV8     SYS     48-63   3
GPU1    NV8      X      SYS     48-63   3
NIC0    SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0


ulimit soft: 1048576
@merrymercy merrymercy added the bug Something isn't working label Jul 27, 2024
@merrymercy
Copy link
Contributor

Thanks for reporting this. I can reproduce the error with llama-3-8b.

One possible reason is that we previously stored the logits in bfloat16. This can cause some problems, but it has been fixed by PR #773. Could you try the main branch again?

After PR #733, temp=1 works great for me. However, I still get random output with temp=3. I guess this is expected because llama-3 has a vocab size of 128K, and temp=3 flattens the logits too much, making softmax/sampling unstable at this point.

@ispobock
Copy link
Collaborator

Hi @merrymercy , the issue seems still there when I use the flashinfer sampling kernel. But it's fine when I switch to the fallback torch implementation.

python3 -m sglang.launch_server --model-path /workdir/llm_models/Meta-Llama-3-8B --port 30000 --trust-remote-code

curl -X POST http://0.0.0.0:30000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/workdir/llm_models/Meta-Llama-3-8B",
"prompt": "Please introduce yourself",
"max_tokens": 100,
"stream": false,
"temperature": 1
}'

env:

cuda 11.8
flashinfer: 0.1.1+cu118torch2.3 & main branch
torch: 2.3.1+cu118
sglang: main branch

cc: @yzh119

@yzh119
Copy link
Collaborator

yzh119 commented Jul 28, 2024

This should be a bug in flashinfer sampling kernel, thanks for reporting.

@merrymercy
Copy link
Contributor

merrymercy commented Jul 28, 2024

@StevenZHB Please try the main branch first to see if putting logits to float32 resolves your issue (#773).
If you still see wrong outputs, please try to add --disable-flashinfer-sampling when you launch the server (#778).

@StevenZHB
Copy link
Author

StevenZHB commented Jul 28, 2024

@merrymercy Thanks, I have tried with the main branch. It still generates wrong outputs. But it can be resolved by setting --disable-flashinfer-sampling. So maybe it's a bug in flashinfer.

@merrymercy
Copy link
Contributor

merrymercy commented Jul 28, 2024

@yzh119 This seems like a critical bug. Could you prioritize a fix?

@yzh119
Copy link
Collaborator

yzh119 commented Jul 28, 2024

Yes I'll release v0.1.2 tonight.

@yzh119
Copy link
Collaborator

yzh119 commented Jul 29, 2024

I just noticed that the user is using cu118, the bug was already fixed in flashinfer-ai/flashinfer#386

@ispobock
Copy link
Collaborator

@yzh119 I tried flashinfer 0.1.2, but still got the issue. This issue is different from flashinfer-ai/flashinfer#384, could you help check?
@StevenZHB could you verify flashinfer 0.1.2 on your environment?

@yzh119
Copy link
Collaborator

yzh119 commented Jul 29, 2024

Sure I can check, could you help dump the logits score so that I can debug on them directly?

@yzh119
Copy link
Collaborator

yzh119 commented Jul 30, 2024

Bug reproduced, I think I have figured out the bug, v0.1.3 will fix the issue.

yzh119 added a commit to flashinfer-ai/flashinfer that referenced this issue Jul 30, 2024
Related issue: sgl-project/sglang#771

This PR fixes the usage of `FlagHeads` cub API in sampling kernels.
As
[documented](https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockDiscontinuity.html),
the default FlagHeads api will always flag the first element, which is
not expected when first element is not `true`.
> For thread0, item input[0] is always flagged.

This PR sets the `tile_predecessor_item` argument (to 0) which will be
compared against input[0].

CUDA 12+ don't have this issue because we are using the new
`SubtractLeft` API instead of `FlagHeads`.
@yzh119
Copy link
Collaborator

yzh119 commented Jul 30, 2024

Fixed in flashinfer-ai/flashinfer#410

@zhyncs
Copy link
Member

zhyncs commented Aug 1, 2024

Hi @StevenZHB May you try the latest main branch? It should be ok now. Thanks.

git clone https://github.com/sgl-project/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/

@StevenZHB
Copy link
Author

Thanks a lot! I tested the latest version, and everything is working perfectly now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants