Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Directories missing in GCS file_mounts #1154

Closed
lhqing opened this issue Sep 4, 2022 · 6 comments · Fixed by #1312
Closed

Directories missing in GCS file_mounts #1154

lhqing opened this issue Sep 4, 2022 · 6 comments · Fixed by #1312
Assignees
Milestone

Comments

@lhqing
Copy link
Contributor

lhqing commented Sep 4, 2022

Hi Skypilot team

Problem
I am using file mount on a GCS bucket, but I found many directories missing in the mounted path on VM.

For example:

# in sky.yaml file
file_mounts:
  /data:
    source: gs://test-bucket/
    mode: MOUNT
# if I do this to transfer a file to my bucket
touch test.txt
gsutil cp test.txt gs://test-bucket/test-new-dir/test.txt

# then, on VM, the test-new-dir and test.txt will not occur
less /data/test-new-dir/test.txt  # File not exist

# only if I explicitly create the dir, the file then occur
mkdir /data/test-new-dir/
less /data/test-new-dir/test.txt

Cause
I think this is because gcsfuse does not implicitly create directories that don't have corresponding objects, unless using a --implicit-dirs flag (has performance drawbacks), as explained here:
https://github.com/GoogleCloudPlatform/gcsfuse/blob/master/docs/semantics.md#implicit-directories
https://stackoverflow.com/questions/38311036/folders-not-showing-up-in-bucket-storage

And skypilot does not use the --implicit-dirs flag in file mount

skypilot/sky/data/storage.py

Lines 1146 to 1150 in f4e785c

mount_cmd = ('gcsfuse -o allow_other '
f'--stat-cache-capacity {self._STAT_CACHE_CAPACITY} '
f'--stat-cache-ttl {self._STAT_CACHE_TTL} '
f'--type-cache-ttl {self._TYPE_CACHE_TTL} '
f'{self.bucket.name} {mount_path}')

Question
Many of my dir structures in buckets are created passively with gsutil cp or similar manners, so I found most of my data not visible in the current skypilot file mount.

I wonder

  1. Should there be a way to add --implicit-dirs flag in GCS related file mount?
  2. Do you have other suggestions to fix this missing dir issue? Another way I can think of is I recursively create all dir by myself with mkdir in VM.
@lhqing
Copy link
Contributor Author

lhqing commented Sep 5, 2022

For now, I did the following in VM with the file mount supported by gcsfuse. And I can see all my files with file_mounts

bucket_name = 'test-bucket'
mount_name = '/data'

# collect all dir paths in the bucket
from gcsfs import GCSFileSystem
fs = GCSFileSystem()
files = fs.glob(f'{bucket_name}/**/*')
file_dirs = set()
for path in files:
    file_dir = str(pathlib.Path(path).parent).replace(bucket_name, mount_name)
    file_dirs.add(file_dir)

# create dirs
for p in file_dirs:
    pathlib.Path(p).mkdir(exist_ok=True, parents=True)

@romilbhardwaj
Copy link
Collaborator

romilbhardwaj commented Sep 5, 2022

Thanks for the detailed report @lhqing! As you correctly identified, we don't add --implicit-dirs when mounting GCS buckets for performance and cost reasons.

As a GCS user, would you like to have --implicit-dirs enabled by default? We can enable it by default if you think it's an important default to have.

As you mention, it would be good to expose this control to the user, but we also want to be careful about adding complexity to our YAML schema. Perhaps we should consider adding a YAML field dev_options with arbitrary key value pairs to the YAML root. We can then add a mount_options key there.

@lhqing
Copy link
Contributor Author

lhqing commented Sep 5, 2022

Hi @romilbhardwaj, I think for now I can manually fix this problem by creating dir by myself. It's OK not to add --implicit-dirs.

I am not sure how much latency and cost this parameter will add if enabled, I guess it might be better to keep gcsfuse's default setting. But the missing data did create some confusion in the beginning, I guess some notice/warning can be added to the doc about using file_mounts on GCP.

The dev_options idea sounds good to me, I'd vote for that feature.

@romilbhardwaj romilbhardwaj self-assigned this Sep 5, 2022
@Michaelvll
Copy link
Collaborator

For now, I did the following in VM with the file mount supported by gcsfuse. And I can see all my files with file_mounts

bucket_name = 'test-bucket'
mount_name = '/data'

# collect all dir paths in the bucket
from gcsfs import GCSFileSystem
fs = GCSFileSystem()
files = fs.glob(f'{bucket_name}/**/*')
file_dirs = set()
for path in files:
    file_dir = str(pathlib.Path(path).parent).replace(bucket_name, mount_name)
    file_dirs.add(file_dir)

# create dirs
for p in file_dirs:
    pathlib.Path(p).mkdir(exist_ok=True, parents=True)

As a first step, probably we can add this solution to our docs, i.e. asking the user to mkdir -p or os.makedirs the parent directory before using the file.

@romilbhardwaj
Copy link
Collaborator

Continuing discussion from #1296.

From @concretevitamin

we need to think about ways to expose these underlying performance vs functionality knobs and flags to power users, perhaps through a ~/.sky/config file. E.g., some users may need --implicit-dirs flag for gcsfuse (mentioned in #1154).

Why don't we add --implicit-dirs ourselves? A GCP blog that turns on this flag and the flag in this PR by default: https://cloud.google.com/blog/topics/developers-practitioners/cloud-storage-file-system-vertex-ai-workbench-notebooks. Maybe these two flags are enough for ML workloads.

I'm currently looking into the performance implications of --implicit-dirs, will update this issue with findings.

@romilbhardwaj
Copy link
Collaborator

I did some simple performance measurements with and without --implicit-dirs. See benchmark numbers at the end.

  • I don't see a significant performance difference with --implicit-dirs on or off.
  • It is hard to measure the cost implications, but I expect them to be small.
  • There exists an alternative solution, which creates the directory objects right after mounting:
BUCKET=my-bucket MOUNTED_AT=/path/to/mount; gsutil ls "gs://$BUCKET/*/*/**" | xargs dirname | sort | uniq | sed "s/gs:\/\/$BUCKET\///" | xargs -I % mkdir -p "$MOUNTED_AT/%"
  • However, this alternative solution won't work for public bucket since the user may not have permissions to create directory objects on the bucket.

Conclusion - It should be okay to enable --implicit-dirs. I'll submit a PR.

Benchmark numbers

Using fio to measure sequential read performance:

fio --name=64kseqreads --rw=read --direct=1 --ioengine=libaio --bs=64k --numjobs=4 --iodepth=128 --size=1G --group_reporting --directory=/noimplicit/ --output-format=json > ~/perf_read_noimplicit.json
With --implicit-dirs Without --implicit-dirs
Sequential Read Throughput (MB/s) 267.13 262
Sequential Read IOPS 4076 3994.25

Using simple shell commands to test 100 small file read/write makespan:

time for i in {0..100}; do echo 'test' > "test${i}.txt"; done # Write
time for i in {0..100}; do cat "test${i}.txt" > /dev/null; done # Read
ls -la # stat operation
With --implicit-dirs Without --implicit-dirs
Write Makespan (s) 58.2s 57.1s
Read Makespan (s) 2.8s 2.9s
ls Makespan (s) 7.3 6.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants