-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hello from fsspec! #96
Comments
👋 @martindurant sounds cool! There are probably a few level of integration that could work well. We intend to make it pretty easy to add custom clients/backends by implementing a relatively small surface area. We're adding more docs on this, but here's the Option (1): Option (2): Other thoughts on interoperability? There's definitely a big chunk of cool things in |
I would say those two options sound pretty similar, except for where the code lives - which I am not too worried about at all. You are probably more likely to get to it, if you have time to experiment - but seeing that this was already requested at fsspec, maybe someone else in the community contributes some pathlib interface anyway. Do you think there's an argument for S3FileSystem, GCSFileSystem and FTPFileSystem from fsspec-land to supplant the implementations you have or are developing? They may be more complete and reduce the amount of work you need to do. |
I took a quick stab at making
I think the tricky part that I don't have a great sense of is how to design the right interface to let users choose to use fsspec and to initialize the correct fsspec The question of moving our backend logic entirely to fsspec is a pretty big one and would involve a lot of changes. This is my first time using fsspec, so I still don't have a great feel for the fsspec design or ecosystem. |
Typical usage in fsspec allows access to the set of backend implementations using a protocol string and arguments, e.g., Many packages that use fsspec, and some internal logic too, uses The current set of optional deps for fsspec are:
but we don't depend on anything just to import the package, and some of the backends like file and memory. Sounds like we came up with a similar set of solutions! Note that there are some implementations that are meant/able to be used with a further target filesystem, such as the local caches. I'd be happy to talk about how fsspec works and describe our design/ecosystem. |
@remi-braun Thanks for the vote for FYI, @jayqi did a good implementation of (Here's the issue just on streaming: #9 ) |
Arf sadly it is much more prosaic: path = AnyPath(path)
# Load vector in cache if needed (geopandas and cloudpathlib are not compatible for now)
if isinstance(path, CloudPath):
path = AnyPath(path.fspath)
vect = geopandas.read_file(vect_path) However I do not use files big enough to see any slowdown caching files in my usecase (apart from |
Ah, interesting. That looks to me like a path = AnyPath(path)
with path.open() as f:
vect = geopandas.read_file(f) Recent versions of In [1]: import pandas as pd
In [2]: from cloudpathlib import CloudPath
In [3]: df = pd.read_csv(CloudPath('s3://drivendata-public-assets/odsc-west-2019/california-tracts.csv'))
In [4]: df.head()
Out[4]:
GEOID year name parent-location population ... eviction-rate eviction-filing-rate low-flag imputed subbed
0 6001400100 2000 4001.0 Alameda County, California 2497.87 ... 0.00 0.00 1 0 0
1 6001400100 2001 4001.0 Alameda County, California 2497.87 ... 0.00 0.86 1 0 0
2 6001400100 2002 4001.0 Alameda County, California 2497.87 ... 0.80 0.80 1 0 0
3 6001400100 2003 4001.0 Alameda County, California 2497.87 ... 1.49 1.49 1 0 0
4 6001400100 2004 4001.0 Alameda County, California 2497.87 ... 0.00 0.00 1 0 0
[5 rows x 27 columns] |
I think the point of the discussion is that we could use cloudpath to have fsspec support generic pathlike objects without having to ask third-party libraries such as geopandas to do extra work. |
Sadly it is not the case, and I don't know why... |
Thanks, @martindurant. How would you use (That said, I would encourage library maintainers to consider supporting (Also, would be curious about your thoughts on testing #109 to see if it is viable! The fact it doesn't take advantage of all of the hard work you've done on caching, streaming, etc. does make it seem less appealing.) |
fsspec can do this job implicitly; but yes, there are libraries which, typically because of some inner C code, can only work with real OS file handles on the local filesystem.
This is an exaggeration! :) On #109, I think it's totally the way forward. For configuring the relevant backends, you could rely on the fsspec config system, or maybe allow extra kwargs when instantiating a Path, which get passed to other downstream Paths. |
😆 certainly an exaggeration in terms of it actually getting implemented by libraries, but it is a PEP that was accepted for exactly this purpose! https://www.python.org/dev/peps/pep-0519/ |
Ok, I'll do some digging to check out the mechanisms you use for this. When we visited this question, implementing |
I'm sure both are useful! fsspec wants to provide full filesystem classes with all the functionality that implies, and several high-level features that you might find interesting, but not necessary for your core offering. To have those and a nice pathlib interface would be awesome! (( For instance (I happen to be working on this today), ReferenceFileSystem lets you view blocks of bytes at arbitrary offsets within arbitrary URLs as files, so that a library like zarr (which uses file naming conventions to locate blocks of array data) can load data from third party HDF5, tiff and grib2 files whether or not the libraries made for those formats know how to handle remote files or not. )) By the way, in my experience, most third party libraries work with file-like objects and paths. If a path-like is passed, they do |
Hi there! I just wanted to mention that I would benefit from support for pathlib with fsspec: dask/dask#8006 Many thanks for any help! :) |
Hi! Ditto @asmith26 - I'd love to see more compatibility between fsspec and cloudpathlib's wonderful pathlib.Path interface. My use case is when using pandas. I mention this in #128. Related issues:
|
See also universal_pathlib, which I think is more recent than this discussion - although I am not in a position to compare the two projects. |
As per @grisaitis, I would also love those two approaches to mix! Other well-known projects supporting |
@remi-braun Are there specific issues that you're seeing? If there are bugs, it would be great to get them addressed. From an API perspective, both from cloudpathlib import CloudPath
from matplotlib import pyplot as plt
import rasterio as rio
import rioxarray
cp = CloudPath("s3://drivendata-competition-biomassters-public-us/train_agbm/0003d2eb_agbm.tif")
# test rasterio directly
dataset = rio.open(cp)
plt.imshow(dataset.read(1), cmap='pink')
#> <matplotlib.image.AxesImage object at 0x1a2a333d0>
dataset.close()
#> None
# test rasterio + xarray
rds = rioxarray.open_rasterio(cp)
rds
#> <xarray.DataArray (band: 1, y: 256, x: 256)>
#> [65536 values with dtype=float32]
#> Coordinates:
#> * band (band) int64 1
#> * x (x) float64 0.5 1.5 2.5 3.5 4.5 ... 252.5 253.5 254.5 255.5
#> * y (y) float64 0.5 1.5 2.5 3.5 4.5 ... 252.5 253.5 254.5 255.5
#> spatial_ref int64 0
#> Attributes:
#> scale_factor: 1.0
#> add_offset: 0.0 I looked at the XArray docs, and there are a done of complex IO options. It looks like a number of the writing methods won't work since they don't support file-like objects. In this case, passing the That said, there still seems to be some confusion here about what we can implement within First, Second, the general case for other libraries. Third, developers of other libraries shouldn't need any special case code for Another separate question is if we will develop and maintain an So with all of that said, there's nothing we can implement in
The other best practice to ensure everything works well is for consumers of One last word here: there many be real bugs on our side that prevent third party libraries from working with |
Hello, Just to answer, since this issue is closed. For single files like A little example: vrt = AnyPath("https://s3.unistra.fr/merge_32-31.vrt")
rasterio.open(str(vrt)) # will work
rasterio.open(vrt) # will fail for the wrong reason, saying that the files linked inside the VRT don't exist |
Thanks @remi-braun, I opened a new issue to track that kind of scenario. |
I see that we have some goals in common. fsspec is a mature library in use by Dask, pandas, zarr and others. However, we don't have a good implementation of
pathlib
on our API. Would you like to join forces to implement pathlib-on-fsspec, using the code you have already developed?xref: fsspec/filesystem_spec#434
another attempt to do this, explicitly for fsspec: https://github.com/datarevenue-berlin/drfs
The text was updated successfully, but these errors were encountered: