Skip to content

Commit

Permalink
DataLoadingThread holding up clean shutdown of evaluator application (#…
Browse files Browse the repository at this point in the history
…2745)

Summary:
Pull Request resolved: #2745

[apf] DataLoadingThread holding up clean shutdown of evaluator application

There have been various reports of checkpoint_eval application getting stuck and getting killed with SJD.

Reviewed By: sarckk

Differential Revision: D69576051

fbshipit-source-id: 46624e5b893cb65bc0fc588a1a11f6f4c710e7f9
  • Loading branch information
satgera authored and facebook-github-bot committed Feb 14, 2025
1 parent 7d161d9 commit 23cf189
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion torchrec/distributed/train_pipeline/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1441,8 +1441,9 @@ def __init__(
memcpy_stream_priority: int = 0,
memcpy_stream: Optional[torch.Stream] = None,
) -> None:
super().__init__()
super().__init__(name="DataLoadingThread")
self._stop: bool = False
self.daemon = True # Mark as daemon thread so that Python will not wait for it at shutdown.
self._dataloader_iter = dataloader_iter
self._buffer_empty_event: Event = Event()
self._buffer_filled_event: Event = Event()
Expand Down

0 comments on commit 23cf189

Please sign in to comment.