Remove OOM-Driven FSDP Deadlocks and Increase Throughput of Automicrobatching #3510

JackZ-db · 2024-07-31T22:36:11Z

Prior to this PR, automicrobatching suffered from both low reliability, in the form of consistent deadlocks when using FSDP, and decreased throughput, compared to if device_train_microbatch_size were manually set to the value auto found. This PR addresses both those issues, removing the sources of OOM-driven FSDP deadlocks and implementing a more intelligent sync hook adding/dropping method that allows auto to perform as well as when dtms is manually set.

Reliability:

Hooks:

With FSDP, there are 4 different sources of deadlocks, which become consistent especially when model size grows to 30b:

Before the forward of an FSDP module, some ranks may OOM and some ranks may not, leading to the OOMing ranks all_reducing when they are trying to see if other ranks are OOMing and the non-OOMing ranks all_gathering when they are unsharding and resharding as they continue into the FSDP modules
a) Solution: Register module.register_forward_pre_hook(sync_hook, prepend=True) on FSDP modules
Before the backwards of an FSDP module, some ranks may OOM and some ranks may not, leading to the OOMing ranks all_reducing when they are trying to see if other ranks are OOMing and the non-OOMing ranks all_gathering when they are unsharding and resharding as they continue into the FSDP modules
a) Solution: Register module.register_full_backward_pre_hook(sync_hook, prepend=True) on FSDP modules
In the FSDP post-backward-hook, post_backward_reshard prefetches unshard, moving the unshard from the beginning of the next backward to the end of this backward
a) Solution: Register module.register_full_backward_hook(sync_hook) on the non-FSDP original modules
Within the FSDP post-backward-hook, the call to unshard reallocates memory, causing some ranks to OOM and others to not OOM. Since there is no syncing within the FSDP native hook, this can lead to deadlock when half the ranks OOM during that realloc and half don’t
a) Solution: Monkeypatch a sync hook right before the realloc; proof of adaptive patch working: test-patch-mpt-125m-chinchilla-regression-LfuGDp

Thrashing:

Thrashing occurs when we are close to the GPU's maximum memory, and alloc_retries consistently occur, leading to lower throughput and a risk of OOMing. If we detect alloc_retries for two consecutive batches, we consider it thrashing, and search downwards for a smaller microbatch size, treating it as if it were an OOM.

Run with thrashing check: mpt-30b-auto-fix-egrtom

Samples/Sec: 7.25
dtms: 4

Run without thrashing check: mpt-30b-auto-fix-ZQmoZp

Samples/Sec: 6.9
dtms: 8

Throughput

We turn all hooks on when we are searching for a microbatch size for the first time, right after we finish an eval in case of a memory spike, when we detect thrashing, and if we hit even a single OOM. We leave these hooks on until we can successfully run batches with the selected microbatch size for 3 consecutive batches. Then, to increase throughput, we drop all hooks until we hit one of the events listed above.

This allows us to achieve the same throughput as if we had manually set the dtms to what auto found.

Experiments

Improvement on throughput depends on whether the bottleneck of training is GPU/CPU bound, which depends on model size/architecture.

MPT-125M, 1 Node:

Auto Pre-PR (dtms = 32): 835 Samples/Sec
Auto Post-PR (dtms = 32): 990 Samples/Sec
Manual on Dev (dtms = 32): 990 Samples/Sec

MPT-1B, 1 Node:

Auto Pre-PR (dtms = 64): 152 Samples/Sec
Auto Post-PR (dtms = 64): 157 Samples/Sec
Manual on Dev (dtms = 64):157 Samples/Sec

MPT-7B, 1 Node:

Auto Pre-PR (dtms = 32): 74 Samples/Sec
Auto Post-PR (dtms = 32): 75.6 Samples/Sec
Manual on Dev (dtms = 32): 76 Samples/Sec

MPT-13B, 2 Node:

Auto Pre-PR (dtms = 32): 41.25 Samples/Sec
Auto Post-PR (dtms = 32): 41.85 Samples/Sec
Manual on Dev (dtms = 32): 41.62 Samples/Sec

MPT-30B, 2 Node:

Auto Pre-PR (dtms = 16): 17.9 Samples/Sec
Auto Post-PR(dtms = 16): 18.26 Samples/Sec
Manual on Dev (dtms = 16): 18.25 Samples/Sec

Credits: One unit test, test_automicrobatching_fsdp, borrows classes introduced in an earlier WIP PR by @bigning.

Successful regression test: mcli logs -f llm-foundry-regression-tests-runner-PW7pJe

Same loss, higher throughput, lower memory usage (due to removal of hooks)

…wers of 2

bigning

Great write-up in the PR description!

composer/trainer/_patch_pytorch.py

composer/trainer/trainer.py

tests/trainer/test_fsdp.py

mvpatel2000

first pass. Need to think through this carefully after comments are addressed, but I think its right.

This is a huge PR 🎉

composer/trainer/trainer.py

composer/distributed/dist_strategy.py

composer/trainer/trainer.py

bigning

Could you please also add the before/after training loss comparison. And make sure the loss is on-par before merge.

composer/trainer/trainer.py

j316chuck · 2024-08-02T07:08:52Z

Thanks for the clean PR description and nice debugging work! Two qq:

On the thrashing section, why is auto throughput samples/sec lower after fix?
Throughput: iiuc, we can: turn on hooks in train forward -> find micro batch size A -> turn off hooks -> oom in eval forward -> turn on hooks again in eval forward -> find microbatch size B -> turn off hooks in eval forward? If so, then the first train batch will != second train batch size. Is this expected?

One high level comment is it might be good to split up the PR into 4 different PR's each containing the four different sources of deadlock each with a repro example + throughput improvement. Would be good for posterity when we are reviewing/thinking of understanding each of the components of deadlocks.

JackZ-db · 2024-08-02T16:40:34Z

Thanks for the clean PR description and nice debugging work! Two qq:

On the thrashing section, why is auto throughput samples/sec lower after fix?

Throughput: iiuc, we can: turn on hooks in train forward -> find micro batch size A -> turn off hooks -> oom in eval forward -> turn on hooks again in eval forward -> find microbatch size B -> turn off hooks in eval forward? If so, then the first train batch will != second train batch size. Is this expected?

One high level comment is it might be good to split up the PR into 4 different PR's each containing the four different sources of deadlock each with a repro example + throughput improvement. Would be good for posterity when we are reviewing/thinking of understanding each of the components of deadlocks.

1 is a good catch, i just switched the two headings - thrashing decreases dtms but still increases throughput.

The hook readding isn't meant for OOMs in evaluation - it's meant to account for a memory spike that we see directly after the first evaluation (future evaluations don't introduce new memory allocations), which then may cause the first train batch after evaluation to OOM.

mvpatel2000

Looks good as a v1.

I agree with @j316chuck in general we should cut this up into smaller PRs, but I think it's fine at this point to merge given we've reviewed and signed off.

@dakinggg feel free to block if you're concerned about thrashing, but I'm pro merging with the cautious approach and we can revisit in follow-on PR

dakinggg · 2024-08-02T17:31:03Z

@mvpatel2000 yeah i think its ok to merge as is and revisit as needed

jacobfulano · 2024-08-04T12:39:05Z

Just wanted to check - has this been tested for scenarios where you are starting from "older" composer checkpoints?

mvpatel2000 · 2024-08-05T16:34:46Z

Just wanted to check - has this been tested for scenarios where you are starting from "older" composer checkpoints?

@jacobfulano it should be independent of checkpoint loading

JackZ-db · 2024-08-05T16:35:13Z

Just wanted to check - has this been tested for scenarios where you are starting from "older" composer checkpoints?

afaik, checkpointing saves automicrobatching info, but once its loaded, its never used, so idt this should be a problem? let me know though if you're referring to something else and I'll take a look!

JackZ-db added 3 commits July 31, 2024 13:59

remove deadlocks + increase throughput for automicrobatching, with po…

cf78f23

…wers of 2

hard code patch falsein init

6f19882

docs + typo

d85deb6

JackZ-db requested review from j316chuck, mvpatel2000 and bigning July 31, 2024 23:34

JackZ-db added 7 commits July 31, 2024 16:55

pyright

864b0c0

pyright: fsdp modules type

e10242f

docs spacing

826811f

ruff

b2ad02e

precommit

9ff6c1f

delete accidental file

88dc6f4

precommit

c845bb1

JackZ-db marked this pull request as ready for review August 1, 2024 02:33

JackZ-db added 2 commits July 31, 2024 21:25

print len fsdp hooks

03b310f

fix thrashing

78bed0e

bigning reviewed Aug 1, 2024

View reviewed changes

mvpatel2000 reviewed Aug 1, 2024

View reviewed changes

JackZ-db added 12 commits August 1, 2024 12:43

pr rev

1db0501

fixing unit test

ac233d6

change runtime warning ignores

3d5e196

print call count

cd17b3f

debug unit test

d908ba1

debug

e3a4e0d

debug

be7b6dd

more logging

b1a3af8

post readd log

11595f2

logs for number of handles and modules

76797b3

log length of readded handles in method

e0aba7e

log if we even enter readding

9573b02

JackZ-db added 3 commits August 1, 2024 13:40

fixed unit test

29ff2c4

precommit

de3fdec

pr rev 2

a5c7ca6

JackZ-db requested review from mvpatel2000 and bigning August 1, 2024 21:16

JackZ-db and others added 2 commits August 1, 2024 14:48

Merge branch 'dev' into jz/fix_powers_of_2

73f6416

precommit

9e4bec0

bigning reviewed Aug 1, 2024

View reviewed changes

composer/trainer/trainer.py Outdated Show resolved Hide resolved

composer/distributed/dist_strategy.py Show resolved Hide resolved

composer/trainer/trainer.py Show resolved Hide resolved

constant for other ranks OOMing

90487eb

JackZ-db requested a review from bigning August 1, 2024 22:56

precommit

dd13c4d

bigning approved these changes Aug 1, 2024

View reviewed changes

dakinggg reviewed Aug 2, 2024

View reviewed changes

composer/trainer/trainer.py Show resolved Hide resolved

Merge branch 'dev' into jz/fix_powers_of_2

ce02927

mvpatel2000 approved these changes Aug 2, 2024

View reviewed changes

JackZ-db merged commit deb39cf into mosaicml:dev Aug 2, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove OOM-Driven FSDP Deadlocks and Increase Throughput of Automicrobatching #3510

Remove OOM-Driven FSDP Deadlocks and Increase Throughput of Automicrobatching #3510

JackZ-db commented Jul 31, 2024 •

edited

Loading

bigning left a comment

mvpatel2000 left a comment

bigning left a comment

j316chuck commented Aug 2, 2024 •

edited

Loading

JackZ-db commented Aug 2, 2024

mvpatel2000 left a comment

dakinggg commented Aug 2, 2024

jacobfulano commented Aug 4, 2024

mvpatel2000 commented Aug 5, 2024

JackZ-db commented Aug 5, 2024

Remove OOM-Driven FSDP Deadlocks and Increase Throughput of Automicrobatching #3510

Remove OOM-Driven FSDP Deadlocks and Increase Throughput of Automicrobatching #3510

Conversation

JackZ-db commented Jul 31, 2024 • edited Loading

bigning left a comment

Choose a reason for hiding this comment

mvpatel2000 left a comment

Choose a reason for hiding this comment

bigning left a comment

Choose a reason for hiding this comment

j316chuck commented Aug 2, 2024 • edited Loading

JackZ-db commented Aug 2, 2024

mvpatel2000 left a comment

Choose a reason for hiding this comment

dakinggg commented Aug 2, 2024

jacobfulano commented Aug 4, 2024

mvpatel2000 commented Aug 5, 2024

JackZ-db commented Aug 5, 2024

JackZ-db commented Jul 31, 2024 •

edited

Loading

j316chuck commented Aug 2, 2024 •

edited

Loading