Add checkpointing support (#9)

* Add checkpointing support * address PR feedback #1
GoogleCloudPlatform · Mar 4, 2024 · 3e676d4 · 3e676d4
1 parent 9ba3f32
commit 3e676d4
Show file tree

Hide file tree

Showing 3 changed files with 93 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -4,6 +4,8 @@ The Dataflux Dataset for PyTorch lets you connect directly to a GCS bucket as a
 
 The Dataflux Dataset for PyTorch implements PyTorch’s [dataset primitive](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) that can be used to efficiently load training data from GCS. The library currently supports [map-style datasets](https://pytorch.org/docs/stable/data.html#map-style-datasets) for random data access patterns.
 
+Furthermore, the Dataflux Dataset for PyTorch provides a checkpointing interface to conveniently save and load checkpoints directly to and from a Google Cloud Storage (GCS) bucket.
+
 Note that the Dataflux Dataset for PyTorch library is in an early preview stage and the team is consistently working on improvements and support for new features.
 
 ## Getting started
@@ -30,8 +32,8 @@ gcloud auth application-default login
 ### Examples
 Before getting started, please make sure you have installed the library and configured authentication following the instructions above.
 
+#### Data Loading
 Dataflux Dataset for PyTorch can be constructed by specifying the project name, bucket name and an optional prefix.
-
 ```python
 from dataflux_pytorch import dataflux_mapstyle_dataset
 
@@ -82,6 +84,32 @@ for each_object in dataset:
   print(each_object)
 ```
 
+#### Checkpointing
+
+The Dataflux Dataset for PyTorch supports fast data loading and allows the user to save and load model checkpoints directly to/from a Google Cloud Storage (GCS) bucket.
+
+```python
+import torch
+import torchvision
+
+from dataflux_pytorch import dataflux_checkpoint
+
+ckpt = dataflux_checkpoint.DatafluxCheckpoint(
+  project_name=PROJECT_NAME, bucket_name=BUCKET_NAME
+)
+CKPT_PATH = "checkpoints/epoch0.ckpt"
+
+model = torchvision.models.resnet50()
+
+with ckpt.writer(CKPT_PATH) as writer:
+  torch.save(model.state_dict(), writer)
+
+with ckpt.reader(CKPT_PATH) as reader:
+  read_state_dict = torch.load(reader)
+
+model.load_state_dict(read_state_dict)
+```
+
 ## Performance
 We tested Dataflux's early performance using [DLIO benchmark](https://github.com/argonne-lcf/dlio_benchmark) simulations with standard mean file-sizes and dataset sizes. A total of 5 training epochs were simulated. For small files (100KB, 500KB), Dataflux can be **2-3x** faster than using GCS native APIs.
 

diff --git a/dataflux_client_python b/dataflux_client_python
diff --git a/dataflux_pytorch/dataflux_checkpoint.py b/dataflux_pytorch/dataflux_checkpoint.py
@@ -0,0 +1,63 @@
+"""
+ Copyright 2024 Google LLC
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+      https://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+ """
+
+from google.cloud import storage
+from google.cloud.storage.fileio import BlobReader, BlobWriter
+from google.api_core.client_info import ClientInfo
+
+from typing import Optional
+
+
+class DatafluxCheckpoint:
+    """Implements the interface of saving and loading model checkpoints.
+
+    The reader and writer return a BlobReader and BlobWriter respectively, which
+    both implement io.BufferedIOBase. Therefore, they can be safely passed to torch.load()
+    and torch.save() to load and save model checkpoints.
+    """
+
+    def __init__(
+        self,
+        project_name: str,
+        bucket_name: str,
+        storage_client: Optional[storage.Client] = None,
+    ):
+        """Initializes the DatafluxCheckpoint.
+
+        Args:
+            project_name: The name of the GCP project.
+            bucket_name: The name of the GCS bucket that is going to hold the checkpoint.
+            storage_client: The google.cloud.storage.Client object initiated with sufficient
+                permission to access the project and the bucket. If not specified, it will
+                be created during initialization with background authentication.
+        """
+        self.project_name = project_name
+        self.bucket_name = bucket_name
+        self.storage_client = storage_client
+        if not storage_client:
+            self.storage_client = storage.Client(
+                project=self.project_name,
+                client_info=ClientInfo(user_agent="dataflux/0.0"),
+            )
+        self.bucket = self.storage_client.bucket(self.bucket_name)
+
+    def reader(self, object_name: str) -> BlobReader:
+        blob = self.bucket.blob(object_name)
+        return blob.open("rb")
+
+    def writer(self, object_name: str) -> BlobWriter:
+        blob = self.bucket.blob(object_name)
+        return blob.open("wb", ignore_flush=True)
+12 −1		README.md
+45 −4		dataflux_core/fast_list.py
+13 −4		dataflux_core/tests/fake_gcs.py
+51 −9		dataflux_core/tests/test_fast_list.py