Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate SSD TBE stage 1 #2078

Closed
wants to merge 1 commit into from

Conversation

henrylhtsang
Copy link
Contributor

Summary:

Plan

Stage 1 aims to ensure that it can run, and won't break from normal operations (e.g. checkpointing).

Checkpointing (i.e. state_dict and load_state_dict) are still work in progress. We also need to guarantee checkpointing for optimizer states.

Stage 2: save state_dict (mostly on fbgemm side)

  • current hope is we can rely on flush to save state dict

Stage 3: load_state_dict (need more thoughts)

  • solution should be similar to that of PS

Stage 4: optimizer states checkpointing (torchrec side, should be pretty standard)

  • should be straightforward
  • need fbgemm to support split_embedding_weights api

Outstanding issues:

  • init is not the same as before
  • SSD TBE doesn't support mixed dim

design doc

TODO:

tests should cover

  • state dict and load state dict (done)
    • should copy dense parts and not break
  • deterministics output (done)
  • numerical equivalence to normal TBE (done)
  • changing learning rate and warm up policy (done)
  • work for different sharding types (done)
  • work with mixed kernel (done)
  • work with mixed sharding types
  • multi-gpu training (todo)

OSS

NOTE: SSD TBE won't work in an OSS environment, due to some rocksdb problem.

ad hoc

  • SSD kernel is guarded, user must specify it in constraints to use it

Differential Revision: D57452256

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 5, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57452256

Summary:
# Plan
Stage 1 aims to ensure that it can run, and won't break from normal operations (e.g. checkpointing).

Checkpointing (i.e. state_dict and load_state_dict) are still work in progress. We also need to guarantee checkpointing for optimizer states.

Stage 2: save state_dict (mostly on fbgemm side)
* current hope is we can rely on flush to save state dict

Stage 3: load_state_dict (need more thoughts)
* solution should be similar to that of PS

Stage 4: optimizer states checkpointing (torchrec side, should be pretty standard)
* should be straightforward
* need fbgemm to support split_embedding_weights api

# Outstanding issues:
* init is not the same as before
* SSD TBE doesn't support mixed dim

# design doc

https://docs.google.com/document/d/1SL1d2Os8KG46ETkCFzrIO0_QMOlFWoTrb8CJiCqGDdk/

# tests should cover
* state dict and load state dict (done)
  * should copy dense parts and not break
* deterministics output (done)
* numerical equivalence to normal TBE (done)
* changing learning rate and warm up policy (done)
* work for different sharding types (done)
* work with mixed kernel (done)
* work with mixed sharding types (done)

# OSS
NOTE: SSD TBE won't work in an OSS environment, due to some rocksdb problem.

# ad hoc
* SSD kernel is guarded, user must specify it in constraints to use it

# Next steps
* add multi-gpu training (todo)
* add optimizer checkpointing support (stage 4)
* add equivalence of CacheParams for PS and SSD configs
* support multi dim via bucketizer
* modify tests to check for multi dim, and remove assertion

Reviewed By: dstaay-fb

Differential Revision: D57452256
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57452256

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants