GPT support for BF16 grad reductions #5920

timmoon10 · 2023-02-04T00:40:31Z

What does this PR do ?

Adds GPT support for BF16/FP16 gradient reductions, with embedding grad reductions in FP32.

Collection: NLP

Changelog

Wrapper for distributed Adam optimizer separately optimizes any parameters that require explicit FP32 gradients, e.g. embeddings.

Usage

Set the optimizer to distributed_fused_adam in the config file:

NeMo/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml

Line 193 in 65c277b

name: fused_adam

Configure the optimizer with grad_sync_dtype: bf16.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

timmoon10 · 2023-02-04T00:41:12Z

~~This currently runs into errors if megatron_amp_O2=True.~~ This has been addressed with NVIDIA/apex#1575.

Signed-off-by: Tim Moon <[email protected]>

nemo/core/optim/distributed_adam.py

Signed-off-by: Tim Moon <[email protected]>

github-actions · 2023-03-14T01:50:03Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

ericharper

LGTM. Thanks!

mikolajblaz · 2023-03-23T12:26:38Z

nemo/core/optim/distributed_adam.py

+
+            # Compute norm of local gradients for explicit FP32 optimizer
+            if self._fp32_optim is not None:
+                _fp32_optim_grad_sync()


self._fp32_optim_grad_sync()?

mikolajblaz · 2023-03-23T12:27:16Z

nemo/core/optim/distributed_adam.py

+                if getattr(param, '_with_fp32_optimizer', False):
+                    main_param = param.detach().clone().float()


param -> model_param?

* Add support for BF16 grad reductions with distopt Signed-off-by: Tim Moon <[email protected]> * Fix style issues Signed-off-by: Tim Moon <[email protected]> * Fix style issues Signed-off-by: Tim Moon <[email protected]> * Update Apex commit Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]>

* GPT support for BF16 grad reductions (#5920) * Add support for BF16 grad reductions with distopt Signed-off-by: Tim Moon <[email protected]> * Fix style issues Signed-off-by: Tim Moon <[email protected]> * Fix style issues Signed-off-by: Tim Moon <[email protected]> * Update Apex commit Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> * Add custom functions to launch distopt communication in interleaved pipeline parallelism (#6183) Signed-off-by: Tim Moon <[email protected]> * Bugfix for BF16 grad reductions with distopt (#6340) * Debug distopt support for BF16 grad reductions Signed-off-by: Tim Moon <[email protected]> * Dump and load FP32 main params Signed-off-by: Tim Moon <[email protected]> * Style tweaks Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Mikołaj Błaż <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Mikołaj Błaż <[email protected]>

* Add support for BF16 grad reductions with distopt Signed-off-by: Tim Moon <[email protected]> * Fix style issues Signed-off-by: Tim Moon <[email protected]> * Fix style issues Signed-off-by: Tim Moon <[email protected]> * Update Apex commit Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Signed-off-by: hsiehjackson <[email protected]>

github-actions bot added core Changes to NeMo Core NLP labels Feb 4, 2023

timmoon10 force-pushed the dist-adam-bf16-grad-sync branch 2 times, most recently from bd9176d to ee8500b Compare February 4, 2023 00:46

github-actions bot added the CI label Feb 4, 2023

timmoon10 force-pushed the dist-adam-bf16-grad-sync branch from 24ef1f6 to 17cc4ae Compare February 4, 2023 01:59

Add support for BF16 grad reductions with distopt

59429ce

Signed-off-by: Tim Moon <[email protected]>

timmoon10 force-pushed the dist-adam-bf16-grad-sync branch from 5476cf1 to 59429ce Compare February 9, 2023 02:05

Fix style issues

68d7441

Signed-off-by: Tim Moon <[email protected]>

timmoon10 force-pushed the dist-adam-bf16-grad-sync branch from 44bcd5f to 68d7441 Compare February 9, 2023 02:37

timmoon10 mentioned this pull request Feb 9, 2023

Decouple optimizer state and grad dtypes in distributed Adam optimizer NVIDIA/apex#1575

Merged

timmoon10 marked this pull request as ready for review February 9, 2023 02:49

github-advanced-security bot found potential problems Feb 9, 2023

View reviewed changes

nemo/core/optim/distributed_adam.py Fixed Show fixed Hide fixed

nemo/core/optim/distributed_adam.py Fixed Show fixed Hide fixed

Fix style issues

b80a1f4

Signed-off-by: Tim Moon <[email protected]>

timmoon10 force-pushed the dist-adam-bf16-grad-sync branch from 5b9b09d to b80a1f4 Compare February 9, 2023 19:58

timmoon10 mentioned this pull request Feb 15, 2023

Support arbitrary output dtypes in PyT GEMM functions NVIDIA/TransformerEngine#75

Merged

okuchaiev requested a review from ericharper February 16, 2023 00:46

timmoon10 added 2 commits February 27, 2023 15:45

Merge branch 'main' into dist-adam-bf16-grad-sync

fef0bf0

Update Apex commit

2888acb

Signed-off-by: Tim Moon <[email protected]>

github-actions bot added the stale label Mar 14, 2023

Merge branch 'main' into dist-adam-bf16-grad-sync

44533e3

timmoon10 removed the stale label Mar 15, 2023

ericharper approved these changes Mar 17, 2023

View reviewed changes

timmoon10 merged commit e5362d2 into NVIDIA:main Mar 17, 2023

mikolajblaz reviewed Mar 23, 2023

View reviewed changes

This was referenced Mar 27, 2023

Bugfix for BF16 grad reductions with distopt #6297

Merged

Bugfix for BF16 grad reductions with distopt #6340

Merged

timmoon10 mentioned this pull request Mar 31, 2023

Cherry-pick recent distopt commits #6343

Merged

8 tasks

timmoon10 deleted the dist-adam-bf16-grad-sync branch September 1, 2023 19:10

timmoon10 mentioned this pull request Sep 1, 2023

Use distributed optimizer support for multiple dtypes #7359

Merged

8 tasks

timmoon10 mentioned this pull request Apr 2, 2024

Distributed optimizer reduces GPT embedding grads in FP32 #8792

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT support for BF16 grad reductions #5920

GPT support for BF16 grad reductions #5920

timmoon10 commented Feb 4, 2023 •

edited

Loading

timmoon10 commented Feb 4, 2023 •

edited

Loading

github-actions bot commented Mar 14, 2023

ericharper left a comment

mikolajblaz Mar 23, 2023

mikolajblaz Mar 23, 2023

		if getattr(param, '_with_fp32_optimizer', False):
		main_param = param.detach().clone().float()

GPT support for BF16 grad reductions #5920

GPT support for BF16 grad reductions #5920

Conversation

timmoon10 commented Feb 4, 2023 • edited Loading

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

timmoon10 commented Feb 4, 2023 • edited Loading

github-actions bot commented Mar 14, 2023

ericharper left a comment

Choose a reason for hiding this comment

mikolajblaz Mar 23, 2023

Choose a reason for hiding this comment

mikolajblaz Mar 23, 2023

Choose a reason for hiding this comment

timmoon10 commented Feb 4, 2023 •

edited

Loading

timmoon10 commented Feb 4, 2023 •

edited

Loading