-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a temp path to save local checkpoints for remote save path #3673
Conversation
43403ea
to
3a7c025
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, but want someone on checkpointing side to have a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure this is OK. /tmp
folders may not have enough space to store all ckpts, depending on machine...
@willgleich would this have worked on azure cluster? Any general takes?
Co-authored-by: Mihir Patel <[email protected]>
On Azure, we mount the NVMe SSDs onto @irenedea If we don't want to make a general change to all of composer, Victor has a PR out to use the NVMe path as the working directory for finetuning specifically. That might be safer in case of unknown differences across machines? |
@mvpatel2000 @jerrychen109 Another benefit of using a tmpdir is that we can configure it via environment variable, so we can redirect it easily if needed to (which we'll need to do for notebooks). My preference would be to always use a tmpdir where possible, especially when saving out large files bc of this configurability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should monitor this closely. Probably fine.
Yeah this seems fine to me, the only worry I see here is checkpoints filling up the disk. /tmp is special in azure as its one of the two mount points for the nvme. On our other clusters we leverage ephemeral storage so /tmp would be included in that. That being said, tempfile.mkdtemp requires manual clean up for the files and directory. Something to consider here if you are hoping for these to be taken care of. There is a context manager that can handle this in a clean way if you do want automatic clean up. tempfile.TemporaryDirectory() |
What does this PR do?
Uses a temp path to save local checkpoints for remote save paths.
This avoids issues where the current path has limits on file sizes or permission restrictions. This also enables more configurability because we can configure the default temp dir if needed.
Testing
Tested on interactive, saving to a remote dbfs path. You can see that the checkpoints are now saved to a local folder:
And the checkpoints are uploaded correctly to the remote location in mlflow artifacts:
Test runs
multi-node
test-tmpdir-checkpoints-remote-16-Ekpzdy
single-node
test-tmpdir-checkpoints-remote-2eaXbb
What issue(s) does this change relate to?
Before submitting
pre-commit
on your change? (see thepre-commit
section of prerequisites)