Log grad norm aggregated over all ranks, not just rank zero #2248

ebsmothers · 2025-01-10T18:43:23Z

Addresses #2240

cc @EugenHotaj @mirceamironenco

pytorch-bot · 2025-01-10T18:43:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2248

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5f16741 with merge base b68cddd ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

joecummings

beaut

mirceamironenco · 2025-01-10T19:07:07Z

recipes/full_finetune_distributed.py

@@ -786,7 +786,7 @@ def train(self) -> None:
                            grad_norm = torch.nn.utils.clip_grad_norm_(
                                self._model.parameters(),
                                max_norm=float(self._clip_grad_norm),
-                            )
+                            ).full_tensor()


Do you think it might be a good idea to have the .full_tensor() behind an isintance(grad_norm, DTensor) check? If e.g. DDP ever gets implemented and torchtune takes care of it behind some API (say shard_model allows for different types of parallelisms, or PP gets added), this will no longer be valid, causing every recipe to be updated?

@mirceamironenco yes I agree we should have the check when we enable new types of parallelism. But I also don't want to prematurely expose it (our recipes are already more complicated than I would like and adding a check that's currently a no-op is an easy case of more code to read than we currently we need). I think your TP example is a very likely case and when we enable something like that wrapping grad norm logic in an appropriate utility (kinda like what you shared with me over Discord) will be the way to go. But until then I don't think we should do it. Hope that makes sense

…2248)

Log grad norm aggregated over all ranks, not just rank zero

5f16741

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 10, 2025

RdoubleA approved these changes Jan 10, 2025

View reviewed changes

joecummings approved these changes Jan 10, 2025

View reviewed changes

mirceamironenco reviewed Jan 10, 2025

View reviewed changes

ebsmothers merged commit f47f633 into pytorch:main Jan 10, 2025
17 checks passed

ebsmothers mentioned this pull request Jan 10, 2025

Grad Norm Differences Across Nodes #2240

Closed

Ankur-singh pushed a commit to Ankur-singh/torchtune that referenced this pull request Jan 18, 2025

Log grad norm aggregated over all ranks, not just rank zero (pytorch#…

f90c6a9

…2248)

RdoubleA mentioned this pull request Jan 21, 2025

v0.6.0 tracker #2232

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log grad norm aggregated over all ranks, not just rank zero #2248

Log grad norm aggregated over all ranks, not just rank zero #2248

ebsmothers commented Jan 10, 2025

pytorch-bot bot commented Jan 10, 2025 •

edited

Loading

joecummings left a comment

mirceamironenco Jan 10, 2025 •

edited

Loading

ebsmothers Jan 10, 2025

Log grad norm aggregated over all ranks, not just rank zero #2248

Log grad norm aggregated over all ranks, not just rank zero #2248

Conversation

ebsmothers commented Jan 10, 2025

pytorch-bot bot commented Jan 10, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2248

✅ No Failures

joecummings left a comment

Choose a reason for hiding this comment

mirceamironenco Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

ebsmothers Jan 10, 2025

Choose a reason for hiding this comment

pytorch-bot bot commented Jan 10, 2025 •

edited

Loading

mirceamironenco Jan 10, 2025 •

edited

Loading