Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Checkpoint V2] Upload API #3488

Merged
merged 7 commits into from
Jul 29, 2024
Merged

Conversation

bigning
Copy link
Contributor

@bigning bigning commented Jul 23, 2024

Adding upload API, design doc. This API uploads the input files to remote object store, interface design:

def upload_file(
    dest_dir: str,
    source_path: Optional[str]=None,
    symlink_granularity: Optional[str]=None, # file, dir, or None
    symlink_name: Optional[str]='latest.symlink',
    async_upload: bool = True,
    state: Optional[State] = None,
    overwrite: bool = False,
):
    """Standalone function for uploading a checkpoint file.
    This function does not actually upload the checkpoint; it initiates the RemoteUploader's uploading of it
    Args:
        source_path (str): The path to the file to upload.
        dest_dir (str): The directory/uri to upload the file to.
        symlink_granularity (Optional[str]): The granularity to use for symlinking. One of 'file', 'dir', or None.
            if None: no symlink uploaded
            if 'file': command remoteuploader to wait until the file (specificied by source_path) is uploaded and then uploads a symlink pointing to the uploaded file
            if 'dir': command remoteuploader  to wait until all files across all ranks are uploaded to dest_dir and then uploads a symlink
                pointing to the remote directory (prefix in object_store terminology).
        symlink_name (Optional[str]): The name to use for the symlink. Defaults to 'latest.symlink'.
        async_upload (bool): If True, the uploads will be done asynchronously via the RemoteUploader and this function will return immediately.
        state (Optional[State]): If async_upload is True, then state must be specified so that the remote_uploader can be
            either extracted from state.callbacks or initialized and added to state.callbacks.
        overwrite (bool): If allow overwrite existing remote checkpoint files
    """ 

unit test

python3 -m composer.cli.launcher -n 2 --master_port 26000 -m pytest -v --durations=20 -m 'not daily and not remote and gpu and (doctest or not doctest)' tests/checkpoint/test_upload.py

@bigning bigning requested review from eracah and mvpatel2000 July 23, 2024 20:36
@bigning bigning marked this pull request as ready for review July 23, 2024 20:36
Copy link
Contributor

@eracah eracah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really slick! Nice work, @bigning ! I will take a look at the unit test tmrw! One request: can we now remove the symlink upload logic from CheckpointSaver now? We can just have CheckpointSaver use CheckpointUploadCallback?

composer/checkpoint/upload.py Show resolved Hide resolved
composer/checkpoint/upload.py Show resolved Hide resolved
composer/checkpoint/upload.py Show resolved Hide resolved
@bigning bigning requested a review from eracah July 25, 2024 22:00
@karan6181
Copy link
Contributor

@bigning, please do not attach the doc link, as not everyone would have access to it. Please add detail description on what would this PR enables and how did you test it

composer/checkpoint/upload.py Show resolved Hide resolved
@bigning
Copy link
Contributor Author

bigning commented Jul 27, 2024

@bigning, please do not attach the doc link, as not everyone would have access to it. Please add detail description on what would this PR enables and how did you test it

@karan6181 , added the interface in the description, and also keep the design doc in the description, so reviewers could have the whole picture of all the APIs.

how did you test it

This was already in the description.

@bigning bigning merged commit c32e7d8 into mosaicml:dev Jul 29, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants