API access #24

mattjbr123 · 2024-10-22T14:41:00Z

One of the things that came out of the work package meeting on 22-10-2024 is that API access to data stored on the object storage is not explicitly included in the workflow diagram.

From my perspective I see this as arising because there is some confusion over what 'API Access' is meaning in this context. The plan is that the data on the object store will be available publically from anywhere (firewalls local-to-the-user not withstanding), and to have code provided on the data catalogue page and/or a link to an analysis platform with such code, that would allow the user to essentially treat the entire dataset as if it was on their own local filesystem.
Is this an API? Is something extra needed to make it an API if not? And do we actually need that something extra?
These are the questions that need clarification to me.

mattjbr123 · 2024-10-24T13:57:54Z

Some helpful thoughts from @fsamreen:

A few thoughts …

Option 1 - provide users with direct URLs to download/interact with the data stored in the S3 bucket.
Pro - it is easier to set up, faster access and might be quick to download but direct access can expose sensitive data or bucket structures.
Cons - difficult to enforce controlled access which might be needed due to various reasons (security, data moving cost, etc).
Option 2 - build a REST API that acts as an intermediary and users interact with the API – handle data requests, process them, and retrieve the data from S3.
Pros- more control over who can access what and how (even we could give controlled access to some datasets through authentication methods). We will have to run API server (management overhead).
Cons - additional development task to implement APIs. Integration with other services would be secure.

An important question here is – ‘Who are the users and how would they like to interact with the data’? followed by - What is a sustainable value-added solution without excessive cost, maintenance and unnecessary implementation overheads? We might end up offering access through various methods including direct access to S3 as well as APIs or even through DataLabs.

mattjbr123 · 2024-11-05T16:21:50Z

Knowing the use cases and user stories, listed here, should help inform the API question...

mattjbr123 · 2024-11-05T16:42:26Z

Which option (1 or 2) is appropriate for each use case/user story?

Some use cases/user stories explored here.
The API is only necessary for the web GUI use-case.
It is optional for the others (mainly the scripts or notebooks access). So long as it is mostly transparent to the user (e.g. by wrapping it up with FSSpec or Intake or whichever library we use to sit between the object store and the user/xarray) it shouldn't make much difference.
Therefore, given we need an API for the web-interface, maybe it makes sense to use it for the other use-cases too, but ensuring it's mostly transparent to the user and they can run their scripts with minimal changes.

mattjbr123 · 2024-11-14T13:40:55Z

How we choose to access the zarr data may affect the API design, as we want to wrap it up with whatever library we use. It's a bit of a complicated landscape at the moment, with lots of libraries that allow S3 access in slightly different ways and that integrate slightly differently with xarray.

My understanding is that there are two main ways of doing this.

Use FSSpec/S3FS (S3FS is part of FSSpec) with the URL of the dataset on S3:

Example 1 (albeit using gcsfs instead of s3fs, but the idea is the same), similar example here too
Example 2:

import s3fs  
fs = s3fs.S3FileSystem(anon=True)  
store = zarr.storage.FSStore('/zarr-demo/store', fs=fs)  
g = zarr.open_group(store)

(from https://zarr.readthedocs.io/en/stable/tutorial.html#io-with-fsspec)

Example 3 in UKCEH DataLabs (private to UKCEH, @dolegi I will give you access)

Use an intake or STAC catalogue to store information about the dataset that can be used by xarray to read in the dataset

Example 1
Example 2
Example 3 (datalabs as above)

Also relevant: https://gallery.pangeo.io/repos/earthcube2020/ec20_abernathey_etal/cloud_storage.html

and: https://colab.research.google.com/github/developmentseed/ingarss-workshop-2024/blob/main/book/docs/01_stac_and_zarr.ipynb

Taking a closer look at these examples it seems to me that all intake is doing is storing/finding the URLs and then passing them to xarray via S3FS anyway. I'm sure it can do a lot more than this, but at least superficially it would seem that these two methods are kinda the same under the hood when intake is involved.

STAC isn't too dissimilar, but usually involves an API layer on top to read/parse the STAC Catalogue JSON file (see comment immediately below for more on STAC)

mattjbr123 · 2024-12-05T16:08:34Z

Some more about STAC catalogues.
Chatting with @dwest77a has opened my eyes to STAC a bit more.
My understanding is now that it is more or less a specifically-formated JSON file that describes/catalogues the datasets and the files in the datasets. This can then be read by an "API" or (python) package that knows how to deal with this specific formatting of a JSON file. It doesn't actually sound too dissimilar to intake. IIRC there still needs to be some thought put into how xarray then picks up the actual data or references to it from the catalogue, but some of the examples (1, 2, 3, 4) in the previous comment should hopefully answer that.

@dwest77a has developed DataPoint, a STAC API (an API that can read STAC-formatted JSON files) originally designed around the STAC catalogues that CEDA hold that describe their huge catalogue of data, specifically the data that has been "kerchunked" to allow for open access over the cloud. DataPoint also has the capability to "automatically ... open cloud datasets given the configuration information in the STAC records that are searched" which is something that could be really useful for our project here. I.e. it automatically can load the data into xarray based on the stuff in the STAC catalogue. I feel like DataPoint is close to being the API we are looking for.
@dolegi could you take a look too and see what you think?

mattjbr123 added this to Gridded data conversion initial product Oct 22, 2024

mattjbr123 converted this from a draft issue Oct 22, 2024

mattjbr123 mentioned this issue Nov 5, 2024

Develop/find some end-user stories #30

Open

dolegi self-assigned this Nov 14, 2024

mattjbr123 mentioned this issue Dec 5, 2024

Learn more about STAC catalogues and whether they fit in with this product #34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API access #24

API access #24

mattjbr123 commented Oct 22, 2024 •

edited

Loading

mattjbr123 commented Oct 24, 2024

mattjbr123 commented Nov 5, 2024

mattjbr123 commented Nov 5, 2024 •

edited

Loading

mattjbr123 commented Nov 14, 2024 •

edited

Loading

mattjbr123 commented Dec 5, 2024 •

edited

Loading

API access #24

API access #24

Comments

mattjbr123 commented Oct 22, 2024 • edited Loading

mattjbr123 commented Oct 24, 2024

mattjbr123 commented Nov 5, 2024

mattjbr123 commented Nov 5, 2024 • edited Loading

mattjbr123 commented Nov 14, 2024 • edited Loading

Use FSSpec/S3FS (S3FS is part of FSSpec) with the URL of the dataset on S3:

Use an intake or STAC catalogue to store information about the dataset that can be used by xarray to read in the dataset

mattjbr123 commented Dec 5, 2024 • edited Loading

mattjbr123 commented Oct 22, 2024 •

edited

Loading

mattjbr123 commented Nov 5, 2024 •

edited

Loading

mattjbr123 commented Nov 14, 2024 •

edited

Loading

mattjbr123 commented Dec 5, 2024 •

edited

Loading