-
Notifications
You must be signed in to change notification settings - Fork 736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] generate wrong sequences with higher temperature #771
Comments
Thanks for reporting this. I can reproduce the error with llama-3-8b. One possible reason is that we previously stored the logits in bfloat16. This can cause some problems, but it has been fixed by PR #773. Could you try the main branch again? After PR #733, temp=1 works great for me. However, I still get random output with temp=3. I guess this is expected because llama-3 has a vocab size of 128K, and temp=3 flattens the logits too much, making softmax/sampling unstable at this point. |
Hi @merrymercy , the issue seems still there when I use the flashinfer sampling kernel. But it's fine when I switch to the fallback torch implementation.
env:
cc: @yzh119 |
This should be a bug in flashinfer sampling kernel, thanks for reporting. |
@StevenZHB Please try the main branch first to see if putting logits to float32 resolves your issue (#773). |
@merrymercy Thanks, I have tried with the main branch. It still generates wrong outputs. But it can be resolved by setting --disable-flashinfer-sampling. So maybe it's a bug in flashinfer. |
@yzh119 This seems like a critical bug. Could you prioritize a fix? |
Yes I'll release v0.1.2 tonight. |
I just noticed that the user is using cu118, the bug was already fixed in flashinfer-ai/flashinfer#386 |
@yzh119 I tried flashinfer 0.1.2, but still got the issue. This issue is different from flashinfer-ai/flashinfer#384, could you help check? |
Sure I can check, could you help dump the logits score so that I can debug on them directly? |
Bug reproduced, I think I have figured out the bug, v0.1.3 will fix the issue. |
Related issue: sgl-project/sglang#771 This PR fixes the usage of `FlagHeads` cub API in sampling kernels. As [documented](https://nvidia.github.io/cccl/cub/api/classcub_1_1BlockDiscontinuity.html), the default FlagHeads api will always flag the first element, which is not expected when first element is not `true`. > For thread0, item input[0] is always flagged. This PR sets the `tile_predecessor_item` argument (to 0) which will be compared against input[0]. CUDA 12+ don't have this issue because we are using the new `SubtractLeft` API instead of `FlagHeads`.
Fixed in flashinfer-ai/flashinfer#410 |
Hi @StevenZHB May you try the latest main branch? It should be ok now. Thanks.
|
Thanks a lot! I tested the latest version, and everything is working perfectly now. |
Checklist
Describe the bug
I started a vllm server and a sglang server with the same model. I found sglang server output unreadable tokens with high temeratur while vllm not.
Example:
sampling_params = {"temperature":1,"n":1}
sglang server: 'Expossible! description! description!爽 ipairs soc!!爽爽爽爽'
vllm server: "Let's start by using the given information to set up three equations:\n1."
sampling_params = {"temperature":0.2,"n":1}
sglang server: "Let's start by using the given information to set up three equations:\n1."
vllm server: "Let's start by using the given information to set up three equations:\n1."
maybe it's related to #523 , I don't know how to fix it
Reproduction
CUDA_VISIBLE_DEVICES=0 nohup python3 -m sglang.launch_server --model-path llama3-8B-instruct --port 9554 --disable-cuda-graph --mem-fraction-static 0.75 --max-prefill-tokens 12800
Environment
The text was updated successfully, but these errors were encountered: