Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero stage 2 : contiguous_gradients and "Reduce" #622

Open
anijmlt opened this issue Dec 28, 2020 · 0 comments
Open

Zero stage 2 : contiguous_gradients and "Reduce" #622

anijmlt opened this issue Dec 28, 2020 · 0 comments

Comments

@anijmlt
Copy link

anijmlt commented Dec 28, 2020

I am running the 8B model as described in Table 1 of the Zero paper. on 8 GPUs.
I notice that the "contiguous_gradients" setting in the config seems to control whether "Reduce" or "Allreduce" is used for gradient reduction in backward pass. That is, I see the following in the NCCL Debug log only when "contiguous_gradients" is "true":
image

I see this being referenced in #264 but it wasn't clear why "contiguous_gradients" should control the communication pattern. As the answer in #264 mentioned it should only "defragment the memory during backward propagation".

cmikeh2 added a commit that referenced this issue Nov 3, 2023
Co-authored-by: Michael Wyatt <[email protected]>
Co-authored-by: Ammar Ahmad Awan <[email protected]>
Co-authored-by: Connor Holmes <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant