Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix AsyncCheckpointIO race condition #138

Merged
merged 2 commits into from
Oct 3, 2024
Merged

Fix AsyncCheckpointIO race condition #138

merged 2 commits into from
Oct 3, 2024

Conversation

awonak
Copy link
Contributor

@awonak awonak commented Oct 3, 2024

Provide a mechanism for calling Checkpoint teardown when using manual checkpoint saves with AsyncCheckpointIO.

When using a large layer count, the save_checkpoint() task was not completing prior to calling load_checkpoint(), resulting in a race condition where load_checkpoint() would return a 404. This was caused by the removal of the teardown step, which issues a blocking shutdown call to the Threadpool executor, waiting for all tasks to complete before returning. We still want to override teardown so that it is not called during trainer.fit(), but we want to make sure the Threadpool tasks complete before making any calls to load_checkpoint().

This PR introduces a new invocation point for the teardown method, which can be called after all save_checkpoint()
calls and before any load_checkpoint() calls.

  • Tests pass
  • Appropriate changes to documentation are included in the PR

Provide a mechanism for calling Checkpoint teardown when using manual checkpoint saves with AsyncCheckpointIO
@awonak awonak requested a review from a team as a code owner October 3, 2024 19:25
@awonak awonak enabled auto-merge (squash) October 3, 2024 22:53
@awonak awonak merged commit 1981245 into main Oct 3, 2024
5 checks passed
@awonak awonak deleted the awonak-async-fix branch October 3, 2024 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants