-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Directories missing in GCS file_mounts #1154
Comments
For now, I did the following in VM with the file mount supported by bucket_name = 'test-bucket'
mount_name = '/data'
# collect all dir paths in the bucket
from gcsfs import GCSFileSystem
fs = GCSFileSystem()
files = fs.glob(f'{bucket_name}/**/*')
file_dirs = set()
for path in files:
file_dir = str(pathlib.Path(path).parent).replace(bucket_name, mount_name)
file_dirs.add(file_dir)
# create dirs
for p in file_dirs:
pathlib.Path(p).mkdir(exist_ok=True, parents=True) |
Thanks for the detailed report @lhqing! As you correctly identified, we don't add As a GCS user, would you like to have As you mention, it would be good to expose this control to the user, but we also want to be careful about adding complexity to our YAML schema. Perhaps we should consider adding a YAML field |
Hi @romilbhardwaj, I think for now I can manually fix this problem by creating dir by myself. It's OK not to add I am not sure how much latency and cost this parameter will add if enabled, I guess it might be better to keep gcsfuse's default setting. But the missing data did create some confusion in the beginning, I guess some notice/warning can be added to the doc about using The |
As a first step, probably we can add this solution to our docs, i.e. asking the user to |
Continuing discussion from #1296. From @concretevitamin
I'm currently looking into the performance implications of |
I did some simple performance measurements with and without
Conclusion - It should be okay to enable Benchmark numbersUsing fio to measure sequential read performance:
Using simple shell commands to test 100 small file read/write makespan:
|
Hi Skypilot team
Problem
I am using file mount on a GCS bucket, but I found many directories missing in the mounted path on VM.
For example:
Cause
I think this is because
gcsfuse
does not implicitly create directories that don't have corresponding objects, unless using a--implicit-dirs
flag (has performance drawbacks), as explained here:https://github.com/GoogleCloudPlatform/gcsfuse/blob/master/docs/semantics.md#implicit-directories
https://stackoverflow.com/questions/38311036/folders-not-showing-up-in-bucket-storage
And skypilot does not use the
--implicit-dirs
flag in file mountskypilot/sky/data/storage.py
Lines 1146 to 1150 in f4e785c
Question
Many of my dir structures in buckets are created passively with
gsutil cp
or similar manners, so I found most of my data not visible in the current skypilot file mount.I wonder
--implicit-dirs
flag in GCS related file mount?mkdir
in VM.The text was updated successfully, but these errors were encountered: