From 483a9679dd90e3272598aa0e2e35cec5889cf6f3 Mon Sep 17 00:00:00 2001 From: Lianmin Zheng Date: Fri, 1 Nov 2024 18:08:50 -0700 Subject: [PATCH 1/3] Add FAQ --- docs/index.rst | 1 + docs/references/faq.md | 18 ++++++++++++++++++ 2 files changed, 19 insertions(+) create mode 100644 docs/references/faq.md diff --git a/docs/index.rst b/docs/index.rst index b365f5701fb..c40cd169f67 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -47,4 +47,5 @@ The core features include: references/benchmark_and_profiling.md references/troubleshooting.md references/custom_chat_template.md + references/faq.md references/learn_more.md diff --git a/docs/references/faq.md b/docs/references/faq.md new file mode 100644 index 00000000000..dc82b5f71a7 --- /dev/null +++ b/docs/references/faq.md @@ -0,0 +1,18 @@ +Here’s the corrected version of your text in US English: +# Frequently Asked Questions + +## The results are not deterministic even with temperature 0 + +When you run decoding with a temperature of 0, obtaining the logprob of input tokens or output tokens, you might notice that the results returned by the engine are not deterministic. +You may observe that when you send the same request twice, the results will be slightly different. + +From our early investigation, this indeterminism arises from two factors: dynamic batching and prefix caching. +Roughly speaking, dynamic batching can account for 95% of the indeterminism, while prefix caching accounts for the remaining portion. The server runs dynamic batching under the hood. Different batch sizes can cause PyTorch/CuBLAS to dispatch to different CUDA kernels, which can lead to slight numerical differences. This difference accumulates throughout many layers and results in nondeterministic output when the batch size changes. Similarly, when prefix caching is turned on, it will also dispatch to different kernels. + +We are still investigating the root cause and possible solutions. In the short term, we might introduce a "deterministic mode" that uses more padding to address the variance from dynamic batching. This mode will be more deterministic but slower. + +On the other hand, if you add `--disable-radix-cache` and only send one request at a time, the results will be mostly deterministic. + +We have two issues to track our progress: +- The deterministic mode is tracked at [https://github.com/sgl-project/sglang/issues/1729](https://github.com/sgl-project/sglang/issues/1729) +- The per-request random seed is tracked at [https://github.com/sgl-project/sglang/issues/1335](https://github.com/sgl-project/sglang/issues/1335) From b4a5280f9b351740e0c106418c834c601d5e3a2c Mon Sep 17 00:00:00 2001 From: Lianmin Zheng Date: Fri, 1 Nov 2024 18:09:15 -0700 Subject: [PATCH 2/3] Fix --- docs/references/faq.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/references/faq.md b/docs/references/faq.md index dc82b5f71a7..2bdf21551f2 100644 --- a/docs/references/faq.md +++ b/docs/references/faq.md @@ -1,4 +1,3 @@ -Here’s the corrected version of your text in US English: # Frequently Asked Questions ## The results are not deterministic even with temperature 0 From 1d54121489fa6e6206f6273b93ecdf18312ced5b Mon Sep 17 00:00:00 2001 From: Lianmin Zheng Date: Fri, 1 Nov 2024 18:15:30 -0700 Subject: [PATCH 3/3] Fix --- docs/references/faq.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/references/faq.md b/docs/references/faq.md index 2bdf21551f2..5a87ba3d87f 100644 --- a/docs/references/faq.md +++ b/docs/references/faq.md @@ -1,17 +1,17 @@ +Here’s the text with corrected grammar and refined phrasing in U.S. English: + # Frequently Asked Questions -## The results are not deterministic even with temperature 0 +## The results are not deterministic, even with a temperature of 0 -When you run decoding with a temperature of 0, obtaining the logprob of input tokens or output tokens, you might notice that the results returned by the engine are not deterministic. -You may observe that when you send the same request twice, the results will be slightly different. +You may notice that when you send the same request twice, the results from the engine will be slightly different, even when the temperature is set to 0. -From our early investigation, this indeterminism arises from two factors: dynamic batching and prefix caching. -Roughly speaking, dynamic batching can account for 95% of the indeterminism, while prefix caching accounts for the remaining portion. The server runs dynamic batching under the hood. Different batch sizes can cause PyTorch/CuBLAS to dispatch to different CUDA kernels, which can lead to slight numerical differences. This difference accumulates throughout many layers and results in nondeterministic output when the batch size changes. Similarly, when prefix caching is turned on, it will also dispatch to different kernels. +From our initial investigation, this indeterminism arises from two factors: dynamic batching and prefix caching. Roughly speaking, dynamic batching accounts for about 95% of the indeterminism, while prefix caching accounts for the remaining portion. The server runs dynamic batching under the hood. Different batch sizes can cause PyTorch/CuBLAS to dispatch to different CUDA kernels, which can lead to slight numerical differences. This difference accumulates across many layers, resulting in nondeterministic output when the batch size changes. Similarly, when prefix caching is enabled, it can also dispatch to different kernels. Even when the computations are mathematically equivalent, small numerical differences from different kernel implementations lead to the final nondeterministic outputs. -We are still investigating the root cause and possible solutions. In the short term, we might introduce a "deterministic mode" that uses more padding to address the variance from dynamic batching. This mode will be more deterministic but slower. +To achieve more deterministic outputs in the current code, you can add `--disable-radix-cache` and send only one request at a time. The results will be mostly deterministic under this setting. -On the other hand, if you add `--disable-radix-cache` and only send one request at a time, the results will be mostly deterministic. +We are still investigating the root causes and potential solutions. In the short term, we may introduce a "deterministic mode" that uses more padding to address the variance caused by dynamic batching. This mode will be more deterministic but slower. We have two issues to track our progress: -- The deterministic mode is tracked at [https://github.com/sgl-project/sglang/issues/1729](https://github.com/sgl-project/sglang/issues/1729) -- The per-request random seed is tracked at [https://github.com/sgl-project/sglang/issues/1335](https://github.com/sgl-project/sglang/issues/1335) +- The deterministic mode is tracked at [https://github.com/sgl-project/sglang/issues/1729](https://github.com/sgl-project/sglang/issues/1729). +- The per-request random seed is tracked at [https://github.com/sgl-project/sglang/issues/1335](https://github.com/sgl-project/sglang/issues/1335). \ No newline at end of file