Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Plan
Stage 1 aims to ensure that it can run, and won't break from normal operations (e.g. checkpointing).
Checkpointing (i.e. state_dict and load_state_dict) are still work in progress. We also need to guarantee checkpointing for optimizer states.
Stage 2: save state_dict (mostly on fbgemm side)
Stage 3: load_state_dict (need more thoughts)
Stage 4: optimizer states checkpointing (torchrec side, should be pretty standard)
Outstanding issues:
design doc
TODO:
tests should cover
OSS
NOTE: SSD TBE won't work in an OSS environment, due to some rocksdb problem.
ad hoc
Differential Revision: D57452256