-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log grad norm aggregated over all ranks, not just rank zero #2248
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2248
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 5f16741 with merge base b68cddd (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beaut
@@ -786,7 +786,7 @@ def train(self) -> None: | |||
grad_norm = torch.nn.utils.clip_grad_norm_( | |||
self._model.parameters(), | |||
max_norm=float(self._clip_grad_norm), | |||
) | |||
).full_tensor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it might be a good idea to have the .full_tensor()
behind an isintance(grad_norm, DTensor)
check? If e.g. DDP ever gets implemented and torchtune takes care of it behind some API (say shard_model
allows for different types of parallelisms, or PP gets added), this will no longer be valid, causing every recipe to be updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mirceamironenco yes I agree we should have the check when we enable new types of parallelism. But I also don't want to prematurely expose it (our recipes are already more complicated than I would like and adding a check that's currently a no-op is an easy case of more code to read than we currently we need). I think your TP example is a very likely case and when we enable something like that wrapping grad norm logic in an appropriate utility (kinda like what you shared with me over Discord) will be the way to go. But until then I don't think we should do it. Hope that makes sense
Addresses #2240
cc @EugenHotaj @mirceamironenco