-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API access #24
Comments
Some helpful thoughts from @fsamreen: A few thoughts … Option 1 - provide users with direct URLs to download/interact with the data stored in the S3 bucket. An important question here is – ‘Who are the users and how would they like to interact with the data’? followed by - What is a sustainable value-added solution without excessive cost, maintenance and unnecessary implementation overheads? We might end up offering access through various methods including direct access to S3 as well as APIs or even through DataLabs. |
Knowing the use cases and user stories, listed here, should help inform the API question... |
Which option (1 or 2) is appropriate for each use case/user story? Some use cases/user stories explored here. |
How we choose to access the zarr data may affect the API design, as we want to wrap it up with whatever library we use. It's a bit of a complicated landscape at the moment, with lots of libraries that allow S3 access in slightly different ways and that integrate slightly differently with xarray. My understanding is that there are two main ways of doing this. Use FSSpec/S3FS (S3FS is part of FSSpec) with the URL of the dataset on S3:
(from https://zarr.readthedocs.io/en/stable/tutorial.html#io-with-fsspec) Use an intake or STAC catalogue to store information about the dataset that can be used by xarray to read in the datasetAlso relevant: https://gallery.pangeo.io/repos/earthcube2020/ec20_abernathey_etal/cloud_storage.html Taking a closer look at these examples it seems to me that all intake is doing is storing/finding the URLs and then passing them to xarray via S3FS anyway. I'm sure it can do a lot more than this, but at least superficially it would seem that these two methods are kinda the same under the hood when intake is involved. STAC isn't too dissimilar, but usually involves an API layer on top to read/parse the STAC Catalogue JSON file (see comment immediately below for more on STAC) |
Some more about STAC catalogues. @dwest77a has developed DataPoint, a STAC API (an API that can read STAC-formatted JSON files) originally designed around the STAC catalogues that CEDA hold that describe their huge catalogue of data, specifically the data that has been "kerchunked" to allow for open access over the cloud. DataPoint also has the capability to "automatically ... open cloud datasets given the configuration information in the STAC records that are searched" which is something that could be really useful for our project here. I.e. it automatically can load the data into xarray based on the stuff in the STAC catalogue. I feel like DataPoint is close to being the API we are looking for. |
One of the things that came out of the work package meeting on 22-10-2024 is that API access to data stored on the object storage is not explicitly included in the workflow diagram.
From my perspective I see this as arising because there is some confusion over what 'API Access' is meaning in this context. The plan is that the data on the object store will be available publically from anywhere (firewalls local-to-the-user not withstanding), and to have code provided on the data catalogue page and/or a link to an analysis platform with such code, that would allow the user to essentially treat the entire dataset as if it was on their own local filesystem.
Is this an API? Is something extra needed to make it an API if not? And do we actually need that something extra?
These are the questions that need clarification to me.
The text was updated successfully, but these errors were encountered: