Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support filesystem abstraction with fsspec or epath. #8336

Open
asherbondy opened this issue Nov 7, 2023 · 9 comments
Open

Support filesystem abstraction with fsspec or epath. #8336

asherbondy opened this issue Nov 7, 2023 · 9 comments

Comments

@asherbondy
Copy link
Contributor

🛠 Proposed Refactor

I propose that we replace usage of core python I/O e.g., open, exists, is_file, makedirs, with equivalent filesystem abstraction that allows dispatch to appropriate filesystem backends. This would allow users of your library to easily work with remote filesystems as they work with local filesystems.

E.g.,

from torch_geometric.datasets import Planetoid
from torch_geometric.transforms import NormalizeFeatures
dataset = Planetoid(root='gs://data/Planetoid', name='Cora', transform=NormalizeFeatures())

Suggest a potential alternative/fix

Either of two popular libraries can be used, fsspec or epath could be used to replace the existing calls to built-in functions that assume local filesystem.

I am happy to make a PR for this if I know the preference of filesystem abstraction library and any other concerns.

@akihironitta
Copy link
Member

I think this will be a great feature to integrate in PyG. I'm not familiar with epath, but I've used fsspec in PyTorch Lightning and it works really nice. Do you have an idea on whether we should go with fsspec or other libraries by chance?

@asherbondy
Copy link
Contributor Author

My personal preference is epath, but it is because I enjoy the pathlib API. The fsspec library seems to be more popular based on stars and also would be more of a 1:1 change based on the current use of built-in FS usage we have in PyG at the moment. E.g.,

with open(path, 'rb') as f:
  ... f stuff ...

vs.

path.read_bytes()

@akihironitta
Copy link
Member

@rusty1s Any thoughts?

@rusty1s
Copy link
Member

rusty1s commented Nov 8, 2023

As far as I understand, fsspec would be easier to integrate right now, so I would vote for this one (but no strong opinion).

@rachitk
Copy link
Contributor

rachitk commented Nov 9, 2023

For what it's worth (as someone who would make heavy use of remote filesystems if this was implemented), I would be strongly in favor of fsspec, mostly because as far as I can tell, fsspec is what is currently used by base PyTorch and would minimize the number of dependencies needed for a base PyG install.

@akihironitta
Copy link
Member

@rachitk I didn't know that PyTorch 2.1.0 added fsspec as one of its dependencies to support different backends for distributed checkpoint.

@asherbondy I think fsspec sounds more reasonable. Would you be interested in contributing this feature?

@rachitk
Copy link
Contributor

rachitk commented Nov 10, 2023

@rachitk I didn't know that PyTorch 2.1.0 added fsspec as one of its dependencies to support different backends for distributed checkpoint.

I actually didn't realize this was a new addition in 2.1.0 myself - I thought it was a much older addition but I see now that it wasn't actually added as a dependency until March of this year.

@asherbondy
Copy link
Contributor Author

Thanks for the discussion! I was also leaning towards fsspec as that seems to be slightly more popular in PyTorch ecosystem and would be easier refactor as mentioned. Now that it seems there is consensus I'm happy to make a CL.

asherbondy pushed a commit to asherbondy/pytorch_geometric that referenced this issue Nov 14, 2023
**Changes made:**
- Add fs_utils.py for API that offers 1:1 replacement for current
filesystem usage.
- Update usages of open, glob, makedirs, normpath to use new API.
- Update test/datasets/* to use in-memory filesystem, simplifying test
and testing fsspec API.
@asherbondy
Copy link
Contributor Author

asherbondy commented Nov 14, 2023

@rusty1s & @akihironitta: I had no experience with fsspec before so I needed to wrap my head around it a bit and run into some practical cases. The PR I've uploaded now is enough details of the API that I'm thinking of and some refactored code and tests for that. Any test previously using the get_dataset test fixture will not be working anymore as it now expects the test to write into the in-memory filesystem. So please take a review of the API and changes to see if we can get maintainer approval to continue the refactor. #8379

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants