Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT support for BF16 grad reductions #5920

Merged
merged 6 commits into from
Mar 17, 2023

Conversation

timmoon10
Copy link
Collaborator

@timmoon10 timmoon10 commented Feb 4, 2023

What does this PR do ?

Adds GPT support for BF16/FP16 gradient reductions, with embedding grad reductions in FP32.

Collection: NLP

Changelog

  • Wrapper for distributed Adam optimizer separately optimizes any parameters that require explicit FP32 gradients, e.g. embeddings.

Usage

Set the optimizer to distributed_fused_adam in the config file:

Configure the optimizer with grad_sync_dtype: bf16.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

@github-actions github-actions bot added core Changes to NeMo Core NLP labels Feb 4, 2023
@timmoon10
Copy link
Collaborator Author

timmoon10 commented Feb 4, 2023

This currently runs into errors if megatron_amp_O2=True. This has been addressed with NVIDIA/apex#1575.

@timmoon10 timmoon10 force-pushed the dist-adam-bf16-grad-sync branch 2 times, most recently from bd9176d to ee8500b Compare February 4, 2023 00:46
@github-actions github-actions bot added the CI label Feb 4, 2023
@timmoon10 timmoon10 force-pushed the dist-adam-bf16-grad-sync branch from 24ef1f6 to 17cc4ae Compare February 4, 2023 01:59
@timmoon10 timmoon10 force-pushed the dist-adam-bf16-grad-sync branch from 5476cf1 to 59429ce Compare February 9, 2023 02:05
Signed-off-by: Tim Moon <[email protected]>
nemo/core/optim/distributed_adam.py Fixed Show fixed Hide fixed
nemo/core/optim/distributed_adam.py Fixed Show fixed Hide fixed
Signed-off-by: Tim Moon <[email protected]>
@github-actions
Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Mar 14, 2023
@timmoon10 timmoon10 removed the stale label Mar 15, 2023
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@timmoon10 timmoon10 merged commit e5362d2 into NVIDIA:main Mar 17, 2023

# Compute norm of local gradients for explicit FP32 optimizer
if self._fp32_optim is not None:
_fp32_optim_grad_sync()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._fp32_optim_grad_sync()?

Comment on lines +75 to +76
if getattr(param, '_with_fp32_optimizer', False):
main_param = param.detach().clone().float()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

param -> model_param?

titu1994 pushed a commit to titu1994/NeMo that referenced this pull request Mar 24, 2023
* Add support for BF16 grad reductions with distopt

Signed-off-by: Tim Moon <[email protected]>

* Fix style issues

Signed-off-by: Tim Moon <[email protected]>

* Fix style issues

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
timmoon10 added a commit to timmoon10/NeMo that referenced this pull request Mar 31, 2023
* Add support for BF16 grad reductions with distopt

Signed-off-by: Tim Moon <[email protected]>

* Fix style issues

Signed-off-by: Tim Moon <[email protected]>

* Fix style issues

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
ericharper pushed a commit that referenced this pull request Apr 3, 2023
* GPT support for BF16 grad reductions (#5920)

* Add support for BF16 grad reductions with distopt

Signed-off-by: Tim Moon <[email protected]>

* Fix style issues

Signed-off-by: Tim Moon <[email protected]>

* Fix style issues

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>

* Add custom functions to launch distopt communication in interleaved pipeline parallelism (#6183)

Signed-off-by: Tim Moon <[email protected]>

* Bugfix for BF16 grad reductions with distopt (#6340)

* Debug distopt support for BF16 grad reductions

Signed-off-by: Tim Moon <[email protected]>

* Dump and load FP32 main params

Signed-off-by: Tim Moon <[email protected]>

* Style tweaks

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Mikołaj Błaż <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Mikołaj Błaż <[email protected]>
hsiehjackson pushed a commit to hsiehjackson/NeMo that referenced this pull request Jun 2, 2023
* Add support for BF16 grad reductions with distopt

Signed-off-by: Tim Moon <[email protected]>

* Fix style issues

Signed-off-by: Tim Moon <[email protected]>

* Fix style issues

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
@timmoon10 timmoon10 deleted the dist-adam-bf16-grad-sync branch September 1, 2023 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI core Changes to NeMo Core NLP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants