From d6410f9051b23359930b548de2067543a30e808d Mon Sep 17 00:00:00 2001 From: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Date: Mon, 25 Nov 2024 13:19:27 -0500 Subject: [PATCH] Fix Doc Error: ZeRO Stage 2 gradient partitioning (#6775) Fix the issue described in https://github.com/microsoft/DeepSpeed/issues/6707 --- docs/_tutorials/zero.md | 2 +- docs/code-docs/source/zero3.rst | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/_tutorials/zero.md b/docs/_tutorials/zero.md index c84339ece9e5..0bb95cbcddd8 100644 --- a/docs/_tutorials/zero.md +++ b/docs/_tutorials/zero.md @@ -13,7 +13,7 @@ ZeRO leverages the aggregate computation and memory resources of data parallelis * **Stage 1**: The optimizer states (e.g., for [Adam optimizer](https://arxiv.org/abs/1412.6980), 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition. -* **Stage 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states. +* **Stage 2**: The reduced 16-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states. * **Stage 3**: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes. diff --git a/docs/code-docs/source/zero3.rst b/docs/code-docs/source/zero3.rst index aa8139a654a1..a24313cadb7a 100644 --- a/docs/code-docs/source/zero3.rst +++ b/docs/code-docs/source/zero3.rst @@ -10,7 +10,7 @@ communication efficiency. #. **ZeRO Stage 1**: The optimizer states (e.g., for `Adam optimizer `_, 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition. -#. **ZeRO Stage 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states. +#. **ZeRO Stage 2**: The reduced 16-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states. #. **ZeRO Stage 3**: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.