Integrate SSD TBE stage 1 #2078

henrylhtsang · 2024-06-05T23:08:25Z

Summary:

Plan

Stage 1 aims to ensure that it can run, and won't break from normal operations (e.g. checkpointing).

Checkpointing (i.e. state_dict and load_state_dict) are still work in progress. We also need to guarantee checkpointing for optimizer states.

Stage 2: save state_dict (mostly on fbgemm side)

current hope is we can rely on flush to save state dict

Stage 3: load_state_dict (need more thoughts)

solution should be similar to that of PS

Stage 4: optimizer states checkpointing (torchrec side, should be pretty standard)

should be straightforward
need fbgemm to support split_embedding_weights api

Outstanding issues:

init is not the same as before
SSD TBE doesn't support mixed dim

design doc

TODO:

tests should cover

state dict and load state dict (done)
- should copy dense parts and not break
deterministics output (done)
numerical equivalence to normal TBE (done)
changing learning rate and warm up policy (done)
work for different sharding types (done)
work with mixed kernel (done)
work with mixed sharding types
multi-gpu training (todo)

OSS

NOTE: SSD TBE won't work in an OSS environment, due to some rocksdb problem.

ad hoc

SSD kernel is guarded, user must specify it in constraints to use it

Differential Revision: D57452256

facebook-github-bot · 2024-06-05T23:08:33Z

This pull request was exported from Phabricator. Differential Revision: D57452256

Summary: # Plan Stage 1 aims to ensure that it can run, and won't break from normal operations (e.g. checkpointing). Checkpointing (i.e. state_dict and load_state_dict) are still work in progress. We also need to guarantee checkpointing for optimizer states. Stage 2: save state_dict (mostly on fbgemm side) * current hope is we can rely on flush to save state dict Stage 3: load_state_dict (need more thoughts) * solution should be similar to that of PS Stage 4: optimizer states checkpointing (torchrec side, should be pretty standard) * should be straightforward * need fbgemm to support split_embedding_weights api # Outstanding issues: * init is not the same as before * SSD TBE doesn't support mixed dim # design doc https://docs.google.com/document/d/1SL1d2Os8KG46ETkCFzrIO0_QMOlFWoTrb8CJiCqGDdk/ # tests should cover * state dict and load state dict (done) * should copy dense parts and not break * deterministics output (done) * numerical equivalence to normal TBE (done) * changing learning rate and warm up policy (done) * work for different sharding types (done) * work with mixed kernel (done) * work with mixed sharding types (done) # OSS NOTE: SSD TBE won't work in an OSS environment, due to some rocksdb problem. # ad hoc * SSD kernel is guarded, user must specify it in constraints to use it # Next steps * add multi-gpu training (todo) * add optimizer checkpointing support (stage 4) * add equivalence of CacheParams for PS and SSD configs * support multi dim via bucketizer * modify tests to check for multi dim, and remove assertion Reviewed By: dstaay-fb Differential Revision: D57452256

facebook-github-bot · 2024-06-11T19:52:19Z

This pull request was exported from Phabricator. Differential Revision: D57452256

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 5, 2024

facebook-github-bot added the fb-exported label Jun 5, 2024

henrylhtsang force-pushed the export-D57452256 branch from 8ac8c0e to 5570843 Compare June 11, 2024 19:52

facebook-github-bot closed this in a6b7151 Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate SSD TBE stage 1 #2078

Integrate SSD TBE stage 1 #2078

henrylhtsang commented Jun 5, 2024

facebook-github-bot commented Jun 5, 2024

facebook-github-bot commented Jun 11, 2024

Integrate SSD TBE stage 1 #2078

Integrate SSD TBE stage 1 #2078

Conversation

henrylhtsang commented Jun 5, 2024

Plan

Outstanding issues:

design doc

tests should cover

OSS

ad hoc

facebook-github-bot commented Jun 5, 2024

facebook-github-bot commented Jun 11, 2024