Enable sharded checkpoint save and load (support local, sharded, and full state dicts for FSDP) #1902

eracah · 2023-01-23T23:57:52Z

What does this PR do?

Enables configuring what type of state dict to user for models and optimizers in State
We enable the user to set fsdp_config['state_dict_type'] to configure these types of state dicts. The types are:

'full' (aka torch.distributed.fsdp.StateDictType.FULL_STATE_DICT) - the full unflattened, unsharded state dict materialized only on rank 0 with cpu offloading
'local' (aka torch.distributed.fsdp.StateDictType.LOCAL_STATE_DICT - materializes just the local flattened shard of the state_dict on each rank
'sharded' (aka torch.distributed.fsdp.StateDictType.SHARDED_STATE_DICT) - returns the unflattened, sharded state_dict.

when a user specifies Trainer(model=my_model, optimiziers=my_optimizer, fsdp_config={..., 'state_dict_type': 'local'}, ...), then they enable local sharding for their model parameters and their optimizer states.

In order to enable this functionality this PR:

Adds logic to State.state_dict() for model parameters and optim parameters to support the three FSDP state dict types
logic for State.load_state_dict() for model and optim that supports state dict types
modified save_checkpoint to allow non-zero ranks to save checkpoints
modified load_checkpoint to allow non-zero ranks to load checkpoints

Todo:

Add some documentation on how to use sharded checkpointing
unit tests

Tests

did a bunch of manual tests

gpt-125m save load
gpt-30b save load
gpt-70b save load

Also did manual tests to make sure proper error raised if:

What issue(s) does this change relate to?

fix CO-1433
fix CO-1682

…sd-type

…out for you

composer/core/state.py

dakinggg

Nice! Please add unit tests :) For manual test, could you also test a resumption run? i.e. i'd like to see that training for 100 batches is the same as training for 50 batches, checkpointing, and then resuming and training for another 50. Also would be good to check with all three types.

composer/trainer/dist_strategy.py

composer/core/state.py

composer/utils/checkpoint.py

eracah · 2023-02-09T22:57:43Z

for manual test, could you also test a resumption run? i.e. i'd like to see that training for 100 batches is the same as training for 50 batches, checkpointing, and then resuming and training for another 50. Also would be good to check with all three types.

@dakinggg, already done see this

eracah · 2023-02-14T02:57:24Z

Ok, @dakinggg, I added unit tests and docs. Give it another look when you get the chance.

dakinggg

LGTM

docs/source/notes/distributed_training.rst

tests/trainer/test_sharded_checkpoint.py

Co-authored-by: Daniel King <[email protected]>

* check non-rank0 ranks only hold a shard for "full" state dict modified * "local" state dicts have flattened shards * "sharded" state dicts have unflattened shards

Co-authored-by: Daniel King <[email protected]>

eracah · 2023-02-16T01:40:15Z

@dakinggg, okay I added the tests you asked for. Can you give it one more quick once over and then I'll merge it?

dakinggg

LGTM, thanks for adding the extra tests!

tests/trainer/test_sharded_checkpoint.py

mvpatel2000

LGTM

eracah added 18 commits January 19, 2023 14:50

add support for other state dict types

eea1080

fix sharded config error

3c0881e

Add non-rank-zero-only saving for fsdp

f90a907

Add optim sharding

1e4fa40

fix bug with list of optimizers

2de7e36

fix sharded optim state dict support

fb79ad1

fix bug with optim state type

c93ce2a

fix bug with shards not uploading to s3

f1af2ad

Add some helpful comments

73097b8

typos

4b3f547

Add sharded load capabilities

600bf8d

Merge branch 'dev' of https://github.com/mosaicml/composer into fsdp-…

f392739

…sd-type

Fix mix up with optim state dict already being flattened

03f164f

fix bug where local state dict was unflattened, then re-flattened.

8c1bc17

commit progress on checking for more or less ranks than shards

688c60e

Add checks for more ranks than checkpoint shards and fewer

f9272d2

remove checks for fewer or more checkpoint files b/c fsdp will error …

c2b7561

…out for you

Add some comments

ebbe196

eracah changed the title ~~Enable local, sharded, and full state dicts for FSDP~~ Enable sharded checkpoint save and load (support local, sharded, and full state dicts for FSDP) Feb 4, 2023

eracah and others added 2 commits February 3, 2023 17:27

Merge branch 'dev' into fsdp-sd-type

6b3e83b

pre-commit

a8841d9

eracah marked this pull request as ready for review February 4, 2023 01:52

eracah requested review from abhi-mosaic, bcui19 and dakinggg February 4, 2023 01:52

Merge branch 'dev' into fsdp-sd-type

b9eb09f

bcui19 reviewed Feb 6, 2023

View reviewed changes

composer/core/state.py Show resolved Hide resolved

bcui19 reviewed Feb 6, 2023

View reviewed changes

composer/core/state.py Outdated Show resolved Hide resolved

dakinggg reviewed Feb 8, 2023

View reviewed changes

eracah added 2 commits February 13, 2023 15:19

add sharded checkpoint test

89f603a

Add fsdp state_dict = full save test

1173a39

eracah requested a review from a team as a code owner February 14, 2023 00:02

eracah and others added 5 commits February 14, 2023 00:04

Add an ignore warning for fsdp unit test

5af3301

Add unit tests for sharded and local checkpoints

c21d4d0

Add docs

103e589

Merge branch 'dev' into fsdp-sd-type

6b31698

pre-commit

233f689

eracah requested review from dakinggg and bcui19 February 14, 2023 02:56

dakinggg approved these changes Feb 15, 2023

View reviewed changes

eracah and others added 8 commits February 15, 2023 13:15

Update docs/source/notes/distributed_training.rst

2165f5e

Co-authored-by: Daniel King <[email protected]>

Update docs/source/notes/distributed_training.rst

005d36d

Co-authored-by: Daniel King <[email protected]>

Update tests/trainer/test_sharded_checkpoint.py

9bd7979

Co-authored-by: Daniel King <[email protected]>

remove ambiguity with local and sharded

704cefc

Add tests for:

1b7cebf

* check non-rank0 ranks only hold a shard for "full" state dict modified * "local" state dicts have flattened shards * "sharded" state dicts have unflattened shards

Merge branch 'dev' into fsdp-sd-type

53b01b7

pre-commit

0732007

Update composer/core/state.py

27847ac

Co-authored-by: Daniel King <[email protected]>

dakinggg approved these changes Feb 16, 2023

View reviewed changes

tests/trainer/test_sharded_checkpoint.py Show resolved Hide resolved

tests/trainer/test_sharded_checkpoint.py Outdated Show resolved Hide resolved

eracah and others added 5 commits February 17, 2023 01:31

Add tests for sharded optim param shapes and ndims

dfdff1d

Merge branch 'dev' into fsdp-sd-type

6494c15

Danile fixes

2215ddd

pre-commit

0867ba8

Fix some comment typos

f55acd6

eracah requested a review from dakinggg February 17, 2023 01:57

mvpatel2000 approved these changes Feb 17, 2023

View reviewed changes

eracah merged commit 6fd5b2d into mosaicml:dev Feb 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable sharded checkpoint save and load (support local, sharded, and full state dicts for FSDP) #1902

Enable sharded checkpoint save and load (support local, sharded, and full state dicts for FSDP) #1902

eracah commented Jan 23, 2023 •

edited

Loading

dakinggg left a comment

eracah commented Feb 9, 2023 •

edited

Loading

eracah commented Feb 14, 2023

dakinggg left a comment

eracah commented Feb 16, 2023

dakinggg left a comment

mvpatel2000 left a comment

Enable sharded checkpoint save and load (support local, sharded, and full state dicts for FSDP) #1902

Enable sharded checkpoint save and load (support local, sharded, and full state dicts for FSDP) #1902

Conversation

eracah commented Jan 23, 2023 • edited Loading

What does this PR do?

Tests

What issue(s) does this change relate to?

dakinggg left a comment

Choose a reason for hiding this comment

eracah commented Feb 9, 2023 • edited Loading

eracah commented Feb 14, 2023

dakinggg left a comment

Choose a reason for hiding this comment

eracah commented Feb 16, 2023

dakinggg left a comment

Choose a reason for hiding this comment

mvpatel2000 left a comment

Choose a reason for hiding this comment

eracah commented Jan 23, 2023 •

edited

Loading

eracah commented Feb 9, 2023 •

edited

Loading