Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[torch2.4] Fix sharded checkpointing backward compatibility issue #3565

Merged
merged 5 commits into from
Aug 21, 2024

Conversation

bigning
Copy link
Contributor

@bigning bigning commented Aug 20, 2024

torch 2.4 breaks the sharded checkpointing backward compatibility. It changed how the save_planner and load_planner flatten the state dict keys. Torch issue: pytorch/pytorch#133923 . So the new load planner can't load the checkpointing saved with old save_planner.

This PR monkey patches the load_planner if it failed to load an old checkpoint, then it removes the patch after loading the checkpointing.

test

daily test: https://github.com/mosaicml/composer/actions/runs/10480275642
image

@bigning bigning marked this pull request as ready for review August 20, 2024 22:42
@bigning bigning requested a review from a team as a code owner August 20, 2024 22:42
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! A few minor style notes before approval.

Can you also please list here: https://docs.google.com/document/d/1m-qWWN3mMTmQOePx0ip5dge1I71_RKXa5QmJm2eFwh4/edit#heading=h.aoxj18348987 under monkeypatches and link github issue?

composer/trainer/_patch_pytorch.py Show resolved Hide resolved
composer/utils/checkpoint.py Show resolved Hide resolved
@bigning
Copy link
Contributor Author

bigning commented Aug 21, 2024

LGTM! A few minor style notes before approval.

Can you also please list here: https://docs.google.com/document/d/1m-qWWN3mMTmQOePx0ip5dge1I71_RKXa5QmJm2eFwh4/edit#heading=h.aoxj18348987 under monkeypatches and link github issue?

added to the doc

Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Will merge as this unblocks main

@mvpatel2000 mvpatel2000 merged commit e88b8ed into main Aug 21, 2024
14 checks passed
@mvpatel2000 mvpatel2000 deleted the torch2.4_ckpt_fix branch August 21, 2024 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants