Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a temp path to save local checkpoints for remote save path #3673

Merged
merged 5 commits into from
Oct 22, 2024

Conversation

irenedea
Copy link
Contributor

@irenedea irenedea commented Oct 21, 2024

What does this PR do?

Uses a temp path to save local checkpoints for remote save paths.

This avoids issues where the current path has limits on file sizes or permission restrictions. This also enables more configurability because we can configure the default temp dir if needed.

Testing

Tested on interactive, saving to a remote dbfs path. You can see that the checkpoints are now saved to a local folder:
image

And the checkpoints are uploaded correctly to the remote location in mlflow artifacts:
image

Test runs
multi-nodetest-tmpdir-checkpoints-remote-16-Ekpzdy
single-node test-tmpdir-checkpoints-remote-2eaXbb

What issue(s) does this change relate to?

Before submitting

  • Have you read the contributor guidelines?
  • Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
  • Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • Did you update any related docs and document your change?
  • Did you update any related tests and add any new tests related to your change? (see testing)
  • Did you run the tests locally to make sure they pass?
  • Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

@irenedea irenedea marked this pull request as draft October 21, 2024 23:37
@irenedea irenedea changed the title Tmpdir checkpoints Use a temp path to save local checkpoints for remote save path Oct 22, 2024
@irenedea irenedea requested review from bigning and a team October 22, 2024 03:57
@irenedea irenedea marked this pull request as ready for review October 22, 2024 03:57
Copy link
Contributor

@snarayan21 snarayan21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, but want someone on checkpointing side to have a look

composer/callbacks/checkpoint_saver.py Show resolved Hide resolved
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure this is OK. /tmp folders may not have enough space to store all ckpts, depending on machine...

@willgleich would this have worked on azure cluster? Any general takes?

composer/callbacks/checkpoint_saver.py Outdated Show resolved Hide resolved
Co-authored-by: Mihir Patel <[email protected]>
@jerrychen109
Copy link
Contributor

jerrychen109 commented Oct 22, 2024

I'm not 100% sure this is OK. /tmp folders may not have enough space to store all ckpts, depending on machine...

@willgleich would this have worked on azure cluster? Any general takes?

On Azure, we mount the NVMe SSDs onto /tmp (and /mnt/data). They're both larger and more performant than the local node storage, and for finetuning at least, we've actually hit out-of-disk failures in the past when we haven't been using the NVMe mounts properly.

@irenedea If we don't want to make a general change to all of composer, Victor has a PR out to use the NVMe path as the working directory for finetuning specifically. That might be safer in case of unknown differences across machines?

@irenedea
Copy link
Contributor Author

irenedea commented Oct 22, 2024

@mvpatel2000 @jerrychen109 Another benefit of using a tmpdir is that we can configure it via environment variable, so we can redirect it easily if needed to (which we'll need to do for notebooks). My preference would be to always use a tmpdir where possible, especially when saving out large files bc of this configurability.

Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should monitor this closely. Probably fine.

@willgleich
Copy link
Member

willgleich commented Oct 22, 2024

Yeah this seems fine to me, the only worry I see here is checkpoints filling up the disk. /tmp is special in azure as its one of the two mount points for the nvme. On our other clusters we leverage ephemeral storage so /tmp would be included in that.

That being said, tempfile.mkdtemp requires manual clean up for the files and directory. Something to consider here if you are hoping for these to be taken care of.

There is a context manager that can handle this in a clean way if you do want automatic clean up. tempfile.TemporaryDirectory()

@irenedea irenedea merged commit ef42f54 into main Oct 22, 2024
14 checks passed
@irenedea irenedea deleted the tmpdir-checkpoints branch October 22, 2024 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants