diff --git a/README_GAUDI.md b/README_GAUDI.md index 5109f7ddf9927..9e289658fd5c2 100644 --- a/README_GAUDI.md +++ b/README_GAUDI.md @@ -315,9 +315,9 @@ mark 90% of free device memory at that point as usable. Next, KV cache gets allocated, model is warmed up, and HPU Graphs are captured. Environment variable `VLLM_GRAPH_RESERVED_MEM` defines the ratio of memory reserved for HPU Graphs capture. With its default value -(`VLLM_GRAPH_RESERVED_MEM=0.4`), 40% of usable memory will be reserved +(`VLLM_GRAPH_RESERVED_MEM=0.1`), 10% of usable memory will be reserved for graph capture (later referred to as \"usable graph memory\"), and -the remaining 60% will be utilized for KV cache. Environment variable +the remaining 90% will be utilized for KV cache. Environment variable `VLLM_GRAPH_PROMPT_RATIO` determines the ratio of usable graph memory reserved for prefill and decode graphs. By default (`VLLM_GRAPH_PROMPT_RATIO=0.5`), both stages have equal memory @@ -445,7 +445,7 @@ Environment variables - `VLLM_SKIP_WARMUP`: if `true`, warmup will be skipped, `false` by default - `VLLM_GRAPH_RESERVED_MEM`: percentage of memory dedicated for - HPUGraph capture, `0.4` by default + HPUGraph capture, `0.1` by default - `VLLM_GRAPH_PROMPT_RATIO`: percentage of reserved graph memory dedicated for prompt graphs, `0.5` by default - `VLLM_GRAPH_PROMPT_STRATEGY`: strategy determining order of prompt diff --git a/docs/source/getting_started/gaudi-installation.rst b/docs/source/getting_started/gaudi-installation.rst index 27bc0186675f1..5af81210c4159 100644 --- a/docs/source/getting_started/gaudi-installation.rst +++ b/docs/source/getting_started/gaudi-installation.rst @@ -243,7 +243,7 @@ Before KV cache gets allocated, model weights are loaded onto the device, and a Only after that, ``gpu_memory_utilization`` flag is utilized - at its default value, will mark 90% of free device memory at that point as usable. Next, KV cache gets allocated, model is warmed up, and HPU Graphs are captured. Environment variable ``VLLM_GRAPH_RESERVED_MEM`` defines the ratio of memory reserved for HPU Graphs capture. -With its default value (``VLLM_GRAPH_RESERVED_MEM=0.4``), 40% of usable memory will be reserved for graph capture (later referred to as "usable graph memory"), and the remaining 60% will be utilized for KV cache. +With its default value (``VLLM_GRAPH_RESERVED_MEM=0.1``), 10% of usable memory will be reserved for graph capture (later referred to as "usable graph memory"), and the remaining 90% will be utilized for KV cache. Environment variable ``VLLM_GRAPH_PROMPT_RATIO`` determines the ratio of usable graph memory reserved for prefill and decode graphs. By default (``VLLM_GRAPH_PROMPT_RATIO=0.5``), both stages have equal memory constraints. Lower value corresponds to less usable graph memory reserved for prefill stage, e.g. ``VLLM_GRAPH_PROMPT_RATIO=0.2`` will reserve 20% of usable graph memory for prefill graphs, and 80% of usable graph memory for decode graphs. @@ -322,7 +322,7 @@ Environment variables **Performance tuning knobs:** - ``VLLM_SKIP_WARMUP``: if ``true``, warmup will be skipped, ``false`` by default -- ``VLLM_GRAPH_RESERVED_MEM``: percentage of memory dedicated for HPUGraph capture, ``0.4`` by default +- ``VLLM_GRAPH_RESERVED_MEM``: percentage of memory dedicated for HPUGraph capture, ``0.1`` by default - ``VLLM_GRAPH_PROMPT_RATIO``: percentage of reserved graph memory dedicated for prompt graphs, ``0.5`` by default - ``VLLM_GRAPH_PROMPT_STRATEGY``: strategy determining order of prompt graph capture, ``min_tokens`` or ``max_bs``, ``min_tokens`` by default - ``VLLM_GRAPH_DECODE_STRATEGY``: strategy determining order of decode graph capture, ``min_tokens`` or ``max_bs``, ``max_bs`` by default diff --git a/vllm/worker/habana_worker.py b/vllm/worker/habana_worker.py index 9d083915041fe..291a7fc0d4489 100644 --- a/vllm/worker/habana_worker.py +++ b/vllm/worker/habana_worker.py @@ -150,7 +150,7 @@ def determine_num_available_blocks(self) -> Tuple[int, int]: cache_block_size = self.get_cache_block_size_bytes() graph_reserved_mem = (float( - os.environ.get('VLLM_GRAPH_RESERVED_MEM', '0.4')) + os.environ.get('VLLM_GRAPH_RESERVED_MEM', '0.1')) if not self.model_config.enforce_eager else 0) graph_headroom = 1 - graph_reserved_mem available_hpu_memory = free_hpu_memory * \