-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(6/n) Support 2D Parallelism - Trainer example #19879
Conversation
92c6d16
to
d81249b
Compare
82a417f
to
23a6848
Compare
⚡ Required checks status: All passing 🟢Groups summary🟢 lightning_fabric: Azure GPU
These checks are required after the changes to Thank you for your contribution! 💜
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
inputs = batch[:, :-1] | ||
labels = batch[:, 1:] | ||
output = self.model(inputs) | ||
with loss_parallel(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL. Are you also enabling this for backward? https://github.com/pytorch/pytorch/blob/5fb11cda4fe60c1a7b30e6c844f84ce8933ef953/torch/distributed/tensor/parallel/loss.py#L35
What does this PR do?
Adds an example for 2D parallelism (Tensor Parallelism + FSDP2) for the Trainer. It is equivalent to the one for Fabric added in #19846, so the model code etc. is copy-pasted. The main file to review is
examples/pytorch/tensor_parallel/train.py
.This PR depends on #19878
📚 Documentation preview 📚: https://pytorch-lightning--19879.org.readthedocs.build/en/19879/
cc @Borda @carmocca @justusschock @awaelchli