Implement async_checkpoint #313

fegin · 2024-05-07T20:12:17Z

Stack from ghstack (oldest at bottom):

-> Implement async_checkpoint #313

Summary:
This PR implements 2 different async checkpoint. The first one is to use
DCP.async_save another one is to use pinned memory + a seperate process
to avoid GILs issue.

[ghstack-poisoned]

Summary: This PR implements 2 different async checkpoint. The first one is to use DCP.async_save another one is to use pinned memory + a seperate process to avoid GILs issue. ghstack-source-id: 87fb6c28d7bc3e514c0bee7646be5188f1f66bbd Pull Request resolved: #313

gnadathur · 2024-05-07T22:23:17Z

It would be good to add an integration test for async checkpoint cc: @fegin

Summary: This PR implements 2 different async checkpoint. The first one is to use DCP.async_save another one is to use pinned memory + a seperate process to avoid GILs issue. ghstack-source-id: 87fb6c28d7bc3e514c0bee7646be5188f1f66bbd Pull Request resolved: pytorch#313

carmocca · 2025-01-22T15:25:53Z

torchtitan/checkpoint.py

+            self._async_with_pinned_memory(checkpoint_id)
+        elif self.async_mode == AsyncMode.ASYNC:
+            self.async_future = dcp.async_save(
+                self.states, checkpoint_id=checkpoint_id, process_group=self.pg


@fegin Why did you choose to use the GLOO process group for the async save? Is it expected to make this more efficient?

Neither the DCP docs nor https://discuss.pytorch.org/t/distributed-w-torchtitan-optimizing-checkpointing-efficiency-with-pytorch-dcp/211250 mention or recommend this.

I'm curious to know if this was on purpose and if you have any numbers to show

Thanks!

We don't want the checkpointing to affect the training, which involve NCCL. So we choose gloo for any checkpointing communication. Also the main bottleneck of checkpointing is unlikely to be the communication. The storage read/write (or upload/download) will be the major overhead.

Update

1e8fd10

[ghstack-poisoned]

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 7, 2024

wz337 approved these changes May 7, 2024

View reviewed changes

fegin merged commit 1e8fd10 into gh/fegin/2/base May 7, 2024
4 checks passed

fegin deleted the gh/fegin/2/head branch May 7, 2024 20:42

carmocca reviewed Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement async_checkpoint #313

Implement async_checkpoint #313

fegin commented May 7, 2024 •

edited

Loading

gnadathur commented May 7, 2024

carmocca Jan 22, 2025

fegin Jan 22, 2025

Implement async_checkpoint #313

Implement async_checkpoint #313

Conversation

fegin commented May 7, 2024 • edited Loading

gnadathur commented May 7, 2024

carmocca Jan 22, 2025

Choose a reason for hiding this comment

fegin Jan 22, 2025

Choose a reason for hiding this comment

fegin commented May 7, 2024 •

edited

Loading