Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement async_checkpoint #313

Merged
merged 1 commit into from
May 7, 2024
Merged

Implement async_checkpoint #313

merged 1 commit into from
May 7, 2024

Conversation

fegin
Copy link
Contributor

@fegin fegin commented May 7, 2024

Stack from ghstack (oldest at bottom):

Summary:
This PR implements 2 different async checkpoint. The first one is to use
DCP.async_save another one is to use pinned memory + a seperate process
to avoid GILs issue.

[ghstack-poisoned]
fegin added a commit that referenced this pull request May 7, 2024
Summary:
This PR implements 2 different async checkpoint. The first one is to use
DCP.async_save another one is to use pinned memory + a seperate process
to avoid GILs issue.

ghstack-source-id: 87fb6c28d7bc3e514c0bee7646be5188f1f66bbd
Pull Request resolved: #313
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 7, 2024
@fegin fegin merged commit 1e8fd10 into gh/fegin/2/base May 7, 2024
4 checks passed
fegin added a commit that referenced this pull request May 7, 2024
Summary:
This PR implements 2 different async checkpoint. The first one is to use
DCP.async_save another one is to use pinned memory + a seperate process
to avoid GILs issue.

ghstack-source-id: 87fb6c28d7bc3e514c0bee7646be5188f1f66bbd
Pull Request resolved: #313
@fegin fegin deleted the gh/fegin/2/head branch May 7, 2024 20:42
@gnadathur
Copy link
Contributor

It would be good to add an integration test for async checkpoint cc: @fegin

tianyu-l pushed a commit to tianyu-l/torchtitan_intern24 that referenced this pull request Aug 16, 2024
Summary:
This PR implements 2 different async checkpoint. The first one is to use
DCP.async_save another one is to use pinned memory + a seperate process
to avoid GILs issue.

ghstack-source-id: 87fb6c28d7bc3e514c0bee7646be5188f1f66bbd
Pull Request resolved: pytorch#313
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
Summary:
This PR implements 2 different async checkpoint. The first one is to use
DCP.async_save another one is to use pinned memory + a seperate process
to avoid GILs issue.

ghstack-source-id: 87fb6c28d7bc3e514c0bee7646be5188f1f66bbd
Pull Request resolved: pytorch#313
self._async_with_pinned_memory(checkpoint_id)
elif self.async_mode == AsyncMode.ASYNC:
self.async_future = dcp.async_save(
self.states, checkpoint_id=checkpoint_id, process_group=self.pg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fegin Why did you choose to use the GLOO process group for the async save? Is it expected to make this more efficient?

Neither the DCP docs nor https://discuss.pytorch.org/t/distributed-w-torchtitan-optimizing-checkpointing-efficiency-with-pytorch-dcp/211250 mention or recommend this.

I'm curious to know if this was on purpose and if you have any numbers to show

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want the checkpointing to affect the training, which involve NCCL. So we choose gloo for any checkpointing communication. Also the main bottleneck of checkpointing is unlikely to be the communication. The storage read/write (or upload/download) will be the major overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants