Support scaled optimizer state in distributed Adam optimizer #1771

timmoon10 · 2024-01-23T00:44:02Z

This PR adds basic support for scaled optimizer state as discussed in the MS-AMP paper. The idea is that per-tensor scaling factors along with FP16/FP8 optimizer state results in lower memory usage than FP32 optimizer state with no degradation in convergence. This implementation is not quite the same as the MS-AMP FP8 optimizer since it only uses FP16 optimizer state and uses per-parameter-fragment scaling factors rather than per-parameter. It is a preliminary implementation and its performance could be improved with custom kernels (e.g. kernel to compute scaling factors, fused kernel with FP16-FP32 casts and Adam step).

In the process of debugging, I've also made some other performance optimizations and bugfixes:

Generalize support for overlapping first grad sync with optimizer step (to be used for NeMo FP8 support)
Fix bug where loading checkpoint does not load parameter group configs
Fix bug where test from Distributed optimizer support for contiguous param buffer with FP8 params #1749 was not doing anything

Signed-off-by: Tim Moon <[email protected]>

Call _check_params_shard_dtypes within _local_step. Fuse scaling factor computation. Signed-off-by: Tim Moon <[email protected]>

Signed-off-by: Tim Moon <[email protected]>

Shows up in PyTorch builds starting 20240118. Signed-off-by: Tim Moon <[email protected]>

timmoon10 added 12 commits January 13, 2024 00:55

Add distopt support for scaled states

d128d31

Signed-off-by: Tim Moon <[email protected]>

Debug distopt checkpointing with scaled optimizer state

f1f36ba

Signed-off-by: Tim Moon <[email protected]>

Debug inconsistent variable name

39d9e23

Signed-off-by: Tim Moon <[email protected]>

Debug checkpointing

f9c7898

Signed-off-by: Tim Moon <[email protected]>

Complain if scaling fp32 states

67fd40f

Signed-off-by: Tim Moon <[email protected]>

Make sure state scaling is done in fp32

dc8d2cc

Signed-off-by: Tim Moon <[email protected]>

Change from per-parameter scaling factors to per-fragment

3f9f724

Call _check_params_shard_dtypes within _local_step. Fuse scaling factor computation. Signed-off-by: Tim Moon <[email protected]>

Support overlapping first bucket AG with scaled state

6baf21e

Signed-off-by: Tim Moon <[email protected]>

Correctly load in per-param-group settings from checkpoint

27f261a

Signed-off-by: Tim Moon <[email protected]>

Handle with contiguous param buffer and int param sync dtype

4d19c3b

Signed-off-by: Tim Moon <[email protected]>

Tweak docs

98b9fa3

Signed-off-by: Tim Moon <[email protected]>

Fix excessive memory usage with scaled optim state

11d0211

Signed-off-by: Tim Moon <[email protected]>

timmoon10 mentioned this pull request Jan 23, 2024

Improve communication overlapping in FP8 distributed optimizer NVIDIA/NeMo#8221

Merged

8 tasks

timmoon10 added 3 commits January 31, 2024 06:56

Silence warning about autograd through broadcast

fe7b812

Signed-off-by: Tim Moon <[email protected]>

Debug tests with multiple models

847296e

Shows up in PyTorch builds starting 20240118. Signed-off-by: Tim Moon <[email protected]>

Merge branch 'master' into distopt-scaled-state

4b96528

crcrpar approved these changes Feb 8, 2024

View reviewed changes

crcrpar merged commit b496d85 into NVIDIA:master Feb 8, 2024

crcrpar added the contrib label Feb 8, 2024

timmoon10 mentioned this pull request Oct 31, 2024

Support using fp16 master weights and fp16/fp8 optimizer states in FusedAdam NVIDIA/TransformerEngine#1078

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support scaled optimizer state in distributed Adam optimizer #1771

Support scaled optimizer state in distributed Adam optimizer #1771

timmoon10 commented Jan 23, 2024

Support scaled optimizer state in distributed Adam optimizer #1771

Support scaled optimizer state in distributed Adam optimizer #1771

Conversation

timmoon10 commented Jan 23, 2024