Fix AsyncCheckpointIO race condition #138

awonak · 2024-10-03T19:25:38Z

Provide a mechanism for calling Checkpoint teardown when using manual checkpoint saves with AsyncCheckpointIO.

When using a large layer count, the save_checkpoint() task was not completing prior to calling load_checkpoint(), resulting in a race condition where load_checkpoint() would return a 404. This was caused by the removal of the teardown step, which issues a blocking shutdown call to the Threadpool executor, waiting for all tasks to complete before returning. We still want to override teardown so that it is not called during trainer.fit(), but we want to make sure the Threadpool tasks complete before making any calls to load_checkpoint().

This PR introduces a new invocation point for the teardown method, which can be called after all save_checkpoint()
calls and before any load_checkpoint() calls.

Tests pass
Appropriate changes to documentation are included in the PR

Provide a mechanism for calling Checkpoint teardown when using manual checkpoint saves with AsyncCheckpointIO

Fix AsyncCheckpointIO race condition

900abfe

Provide a mechanism for calling Checkpoint teardown when using manual checkpoint saves with AsyncCheckpointIO

awonak requested a review from a team as a code owner October 3, 2024 19:25

awonak requested review from MattIrv, jdnurme and abhibyreddi October 3, 2024 19:25

abhibyreddi approved these changes Oct 3, 2024

View reviewed changes

MattIrv approved these changes Oct 3, 2024

View reviewed changes

jdnurme approved these changes Oct 3, 2024

View reviewed changes

Merge branch 'main' into awonak-async-fix

b1e3968

awonak enabled auto-merge (squash) October 3, 2024 22:53

awonak merged commit 1981245 into main Oct 3, 2024
5 checks passed

awonak deleted the awonak-async-fix branch October 3, 2024 23:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix AsyncCheckpointIO race condition #138

Fix AsyncCheckpointIO race condition #138

awonak commented Oct 3, 2024 •

edited

Loading

Fix AsyncCheckpointIO race condition #138

Fix AsyncCheckpointIO race condition #138

Conversation

awonak commented Oct 3, 2024 • edited Loading

awonak commented Oct 3, 2024 •

edited

Loading