Zero stage 2 : contiguous_gradients and "Reduce" #622

anijmlt · 2020-12-28T19:53:45Z

I am running the 8B model as described in Table 1 of the Zero paper. on 8 GPUs.
I notice that the "contiguous_gradients" setting in the config seems to control whether "Reduce" or "Allreduce" is used for gradient reduction in backward pass. That is, I see the following in the NCCL Debug log only when "contiguous_gradients" is "true":

I see this being referenced in #264 but it wasn't clear why "contiguous_gradients" should control the communication pattern. As the answer in #264 mentioned it should only "defragment the memory during backward propagation".

Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Connor Holmes <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Logan Adams <[email protected]>

anijmlt mentioned this issue Aug 12, 2021

why use for all-reduce when contiguous_gradients is False? #1300

Open

abhilash1910 mentioned this issue May 15, 2023

Resolve hard dependency on MOE for contiguous_gradients on stage 1 #3545

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero stage 2 : contiguous_gradients and "Reduce" #622

Zero stage 2 : contiguous_gradients and "Reduce" #622

anijmlt commented Dec 28, 2020

Zero stage 2 : contiguous_gradients and "Reduce" #622

Zero stage 2 : contiguous_gradients and "Reduce" #622

Comments

anijmlt commented Dec 28, 2020