[WIP] Use torch 2.2 distributed checkpoint APIs for FSDP #19497

carmocca · 2024-02-19T17:55:53Z

What does this PR do?

Fixes #19462

Resources

TODO:

Test with different checkpoints. PyTorch docs say:

There is no guarantees of Backwards Compatibility across PyTorch versions for saved state_dicts.
Apply the same changes to the Trainer
Run tests with _TORCH_GREATER_EQUAL_2_2 = False since CI only tests 2.2

📚 Documentation preview 📚: https://pytorch-lightning--19497.org.readthedocs.build/en/19497/

carmocca

Currently blocked by pytorch/pytorch#119800 (comment)

carmocca · 2024-02-19T20:02:54Z

src/lightning/fabric/strategies/fsdp.py

+        if _TORCH_GREATER_EQUAL_2_2:
+            from torch.distributed.checkpoint.state_dict import StateDictOptions, set_model_state_dict
+
+            # `cpu_offload` disabled because when used with `full_state_dict` only rank 0 loads the state dict


Notice that the other path sets rank0_only=False. I asked if this could be configurable in pytorch/pytorch#112837 (comment)

carmocca · 2024-02-19T20:03:22Z

src/lightning/fabric/strategies/fsdp.py

@@ -440,6 +439,7 @@ def save_checkpoint(
            )
        if filter is not None and self._state_dict_type == "sharded":
            # https://github.com/pytorch/pytorch/issues/105379
+            # FIXME: revisit support with new APIs


Reminder to myself

for more information, see https://pre-commit.ci

carmocca · 2024-05-13T13:20:32Z

Closing since the new model parallel strategy is most likely the future. Implemented in #19852

Use torch 2.2 distributed checkpoint APIs for FSDP

0c92a39

carmocca added refactor strategy: fsdp Fully Sharded Data Parallel labels Feb 19, 2024

carmocca self-assigned this Feb 19, 2024

github-actions bot added the fabric lightning.fabric.Fabric label Feb 19, 2024

mypy

1b1f03e

carmocca force-pushed the carmocca/new-fsdp-ckpt-apis branch from 97469de to fb62091 Compare February 19, 2024 19:57

WIP

bf021fe

carmocca force-pushed the carmocca/new-fsdp-ckpt-apis branch from fb62091 to bf021fe Compare February 19, 2024 20:00

carmocca commented Feb 19, 2024

View reviewed changes

carmocca and others added 2 commits February 20, 2024 19:56

Update fsdp.py

39c0928

[pre-commit.ci] auto fixes from pre-commit.com hooks

916c2e6

for more information, see https://pre-commit.ci

carmocca closed this May 13, 2024

carmocca deleted the carmocca/new-fsdp-ckpt-apis branch May 13, 2024 13:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Use torch 2.2 distributed checkpoint APIs for FSDP #19497

[WIP] Use torch 2.2 distributed checkpoint APIs for FSDP #19497

carmocca commented Feb 19, 2024 •

edited

Loading

carmocca left a comment

carmocca Feb 19, 2024

carmocca Feb 19, 2024

carmocca commented May 13, 2024

[WIP] Use torch 2.2 distributed checkpoint APIs for FSDP #19497

[WIP] Use torch 2.2 distributed checkpoint APIs for FSDP #19497

Conversation

carmocca commented Feb 19, 2024 • edited Loading

What does this PR do?

Resources

TODO:

carmocca left a comment

Choose a reason for hiding this comment

carmocca Feb 19, 2024

Choose a reason for hiding this comment

carmocca Feb 19, 2024

Choose a reason for hiding this comment

carmocca commented May 13, 2024

carmocca commented Feb 19, 2024 •

edited

Loading