[Feat] DeepSpeed single file saving #6900

SeanNaren · 2021-04-08T18:02:31Z

What does this PR do?

DeepSpeed has released a new version which contains all gathering weights to a single process and saving. As discussed in #6691 it's probably best for now to prioritise one file saving as this is easiest to deploy etc. As a result I've made this the default behaviour when using ZeRO 3.

In addition, if a user uses the configure_sharded_model hook, it is necessary to call this in the load_from_checkpoint function as the test indicates via the model hook. This could potentially be done for the user, however right now I'll make it clear the docs in a separate PR that this needs to be handled manually.

cc @ananthsub @tchaton

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

tchaton

Overall LGTM ! Small nits

pytorch_lightning/plugins/training_type/deepspeed.py

tests/plugins/test_deepspeed_plugin.py

pytorch_lightning/plugins/training_type/deepspeed.py

…eed team

codecov · 2021-04-09T10:51:50Z

Codecov Report

Merging #6900 (dbff37d) into master (e9c3e02) will decrease coverage by 5%.
The diff coverage is 0%.

@@           Coverage Diff           @@
##           master   #6900    +/-   ##
=======================================
- Coverage      92%     87%    -5%     
=======================================
  Files         194     194            
  Lines       12346   12355     +9     
=======================================
- Hits        11327   10688   -639     
- Misses       1019    1667   +648

kaushikb11 · 2021-04-09T13:10:03Z

Hi @SeanNaren, this particular test is failing.

SeanNaren · 2021-04-10T10:14:00Z

I'm also noticing the container is failing when DeepSpeed is added to the extras. I might move to having DeepSpeed installed within the GPU container instead of the extras file.

ananthsub

@SeanNaren totally separate from this PR, but have you seen checkpoint saving become a bottleneck at all with deepspeed? I think there's room to optimize the saving logic inside the lightning checkpoint callback, since we are potentially re-running the broadcast/state dict modification 2 times here: https://github.com/PyTorchLightning/pytorch-lightning/blob/fe0d08899eba94d275ff42253f495d9e70d86f89/pytorch_lightning/callbacks/model_checkpoint.py#L278-L286

SeanNaren · 2021-04-12T09:45:52Z

pytorch-lightning/pytorch_lightning/callbacks/model_checkpoint.py

If multiple of these modes are used, then definitely it would be an expensive operation. Currently I've only tested with save_last to ensure the final model can be saved. There also seems to be a limit in terms of the the all gather required to save, but this might be from Lightning not de-allocating memory (need to check after this PR is merged).

Rudimentary numbers I collected for 8 A100s using minGPT and using time probing when saving the model to disk

13B parameters with ZeRO 3 + offloading
Time taken to consolidate 10.642701387405396s (picked one of the 8 values logged from the GPUs, all were about the same)
Time taken to dump 95.15039467811584s

20B parameters with ZeRO 3 + offloading
Blows up CUDA OOM

20B parameters with ZeRO 3 + offloading + deepspeed checkpointing
Blows up CUDA OOM

SeanNaren added 4 commits April 7, 2021 16:34

Add single checkpoint capability

f991082

Fix checkpointing in test, few cleanups

2a7bd19

Add comment

30c6508

Change restore logic

c9ce1fe

SeanNaren added feature Is an improvement or enhancement distributed Generic distributed-related topic labels Apr 8, 2021

SeanNaren added this to the 1.3 milestone Apr 8, 2021

SeanNaren requested review from ananthsub, tchaton and a team April 8, 2021 18:02

SeanNaren self-assigned this Apr 8, 2021

SeanNaren requested review from awaelchli, Borda, carmocca, justusschock, kaushikb11 and williamFalcon as code owners April 8, 2021 18:02

tchaton approved these changes Apr 8, 2021

View reviewed changes

ananthsub reviewed Apr 9, 2021

View reviewed changes

pytorch_lightning/plugins/training_type/deepspeed.py Outdated Show resolved Hide resolved

ananthsub approved these changes Apr 9, 2021

View reviewed changes

SeanNaren added 2 commits April 9, 2021 10:25

Move vars around, add better explanation, make todo align with DeepSp…

5c9e9dd

…eed team

Merge branch 'master' into feat/ds_single_ckpt

d613d84

Fix checkpointing

966c8b8

awaelchli approved these changes Apr 9, 2021

View reviewed changes

SeanNaren added 3 commits April 12, 2021 10:21

Remove deepspeed from extra, install in Dockerfile

9ac2bc5

push

9b3ca4d

pull

b8a2190

ananthsub reviewed Apr 12, 2021

View reviewed changes

Split to two tests to see if it fixes Deepspeed error

8dc9b1c

SeanNaren enabled auto-merge (squash) April 12, 2021 13:31

Add comment

2113b8e

SeanNaren disabled auto-merge April 12, 2021 14:58

SeanNaren enabled auto-merge (squash) April 12, 2021 15:00

Merge branch 'master' into feat/ds_single_ckpt

dbff37d

SeanNaren merged commit b46cc55 into master Apr 12, 2021

SeanNaren deleted the feat/ds_single_ckpt branch April 12, 2021 22:44

SeanNaren mentioned this pull request Apr 13, 2021

1.2.x cherries 🍒 #6083

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] DeepSpeed single file saving #6900

[Feat] DeepSpeed single file saving #6900

SeanNaren commented Apr 8, 2021 •

edited

Loading

tchaton left a comment

codecov bot commented Apr 9, 2021 •

edited

Loading

kaushikb11 commented Apr 9, 2021

SeanNaren commented Apr 10, 2021

ananthsub left a comment •

edited

Loading

SeanNaren commented Apr 12, 2021

[Feat] DeepSpeed single file saving #6900

[Feat] DeepSpeed single file saving #6900

Conversation

SeanNaren commented Apr 8, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

tchaton left a comment

Choose a reason for hiding this comment

codecov bot commented Apr 9, 2021 • edited Loading

Codecov Report

kaushikb11 commented Apr 9, 2021

SeanNaren commented Apr 10, 2021

ananthsub left a comment • edited Loading

Choose a reason for hiding this comment

SeanNaren commented Apr 12, 2021

SeanNaren commented Apr 8, 2021 •

edited

Loading

codecov bot commented Apr 9, 2021 •

edited

Loading

ananthsub left a comment •

edited

Loading