Fix AsyncCheckpointIO race condition #138
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Provide a mechanism for calling Checkpoint teardown when using manual checkpoint saves with AsyncCheckpointIO.
When using a large layer count, the
save_checkpoint()
task was not completing prior to callingload_checkpoint()
, resulting in a race condition whereload_checkpoint()
would return a 404. This was caused by the removal of the teardown step, which issues a blocking shutdown call to the Threadpool executor, waiting for all tasks to complete before returning. We still want to override teardown so that it is not called duringtrainer.fit()
, but we want to make sure the Threadpool tasks complete before making any calls toload_checkpoint()
.This PR introduces a new invocation point for the teardown method, which can be called after all
save_checkpoint()
calls and before any
load_checkpoint()
calls.