Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API access #24

Open
mattjbr123 opened this issue Oct 22, 2024 · 5 comments
Open

API access #24

mattjbr123 opened this issue Oct 22, 2024 · 5 comments
Assignees

Comments

@mattjbr123
Copy link
Collaborator

mattjbr123 commented Oct 22, 2024

One of the things that came out of the work package meeting on 22-10-2024 is that API access to data stored on the object storage is not explicitly included in the workflow diagram.

From my perspective I see this as arising because there is some confusion over what 'API Access' is meaning in this context. The plan is that the data on the object store will be available publically from anywhere (firewalls local-to-the-user not withstanding), and to have code provided on the data catalogue page and/or a link to an analysis platform with such code, that would allow the user to essentially treat the entire dataset as if it was on their own local filesystem.
Is this an API? Is something extra needed to make it an API if not? And do we actually need that something extra?
These are the questions that need clarification to me.

@mattjbr123
Copy link
Collaborator Author

Some helpful thoughts from @fsamreen:

A few thoughts …

Option 1 - provide users with direct URLs to download/interact with the data stored in the S3 bucket.
Pro - it is easier to set up, faster access and might be quick to download but direct access can expose sensitive data or bucket structures.
Cons - difficult to enforce controlled access which might be needed due to various reasons (security, data moving cost, etc).
Option 2 - build a REST API that acts as an intermediary and users interact with the API – handle data requests, process them, and retrieve the data from S3.
Pros- more control over who can access what and how (even we could give controlled access to some datasets through authentication methods). We will have to run API server (management overhead).
Cons - additional development task to implement APIs. Integration with other services would be secure.

An important question here is – ‘Who are the users and how would they like to interact with the data’? followed by - What is a sustainable value-added solution without excessive cost, maintenance and unnecessary implementation overheads? We might end up offering access through various methods including direct access to S3 as well as APIs or even through DataLabs.

@mattjbr123
Copy link
Collaborator Author

Knowing the use cases and user stories, listed here, should help inform the API question...

@mattjbr123
Copy link
Collaborator Author

mattjbr123 commented Nov 5, 2024

Which option (1 or 2) is appropriate for each use case/user story?

Some use cases/user stories explored here.
The API is only necessary for the web GUI use-case.
It is optional for the others (mainly the scripts or notebooks access). So long as it is mostly transparent to the user (e.g. by wrapping it up with FSSpec or Intake or whichever library we use to sit between the object store and the user/xarray) it shouldn't make much difference.
Therefore, given we need an API for the web-interface, maybe it makes sense to use it for the other use-cases too, but ensuring it's mostly transparent to the user and they can run their scripts with minimal changes.

@dolegi dolegi self-assigned this Nov 14, 2024
@mattjbr123
Copy link
Collaborator Author

mattjbr123 commented Nov 14, 2024

How we choose to access the zarr data may affect the API design, as we want to wrap it up with whatever library we use. It's a bit of a complicated landscape at the moment, with lots of libraries that allow S3 access in slightly different ways and that integrate slightly differently with xarray.

My understanding is that there are two main ways of doing this.

Use FSSpec/S3FS (S3FS is part of FSSpec) with the URL of the dataset on S3:

import s3fs  
fs = s3fs.S3FileSystem(anon=True)  
store = zarr.storage.FSStore('/zarr-demo/store', fs=fs)  
g = zarr.open_group(store)  

(from https://zarr.readthedocs.io/en/stable/tutorial.html#io-with-fsspec)

  • Example 3 in UKCEH DataLabs (private to UKCEH, @dolegi I will give you access)

Use an intake or STAC catalogue to store information about the dataset that can be used by xarray to read in the dataset

Also relevant: https://gallery.pangeo.io/repos/earthcube2020/ec20_abernathey_etal/cloud_storage.html

and: https://colab.research.google.com/github/developmentseed/ingarss-workshop-2024/blob/main/book/docs/01_stac_and_zarr.ipynb

Taking a closer look at these examples it seems to me that all intake is doing is storing/finding the URLs and then passing them to xarray via S3FS anyway. I'm sure it can do a lot more than this, but at least superficially it would seem that these two methods are kinda the same under the hood when intake is involved.

STAC isn't too dissimilar, but usually involves an API layer on top to read/parse the STAC Catalogue JSON file (see comment immediately below for more on STAC)

@mattjbr123
Copy link
Collaborator Author

mattjbr123 commented Dec 5, 2024

Some more about STAC catalogues.
Chatting with @dwest77a has opened my eyes to STAC a bit more.
My understanding is now that it is more or less a specifically-formated JSON file that describes/catalogues the datasets and the files in the datasets. This can then be read by an "API" or (python) package that knows how to deal with this specific formatting of a JSON file. It doesn't actually sound too dissimilar to intake. IIRC there still needs to be some thought put into how xarray then picks up the actual data or references to it from the catalogue, but some of the examples (1, 2, 3, 4) in the previous comment should hopefully answer that.

@dwest77a has developed DataPoint, a STAC API (an API that can read STAC-formatted JSON files) originally designed around the STAC catalogues that CEDA hold that describe their huge catalogue of data, specifically the data that has been "kerchunked" to allow for open access over the cloud. DataPoint also has the capability to "automatically ... open cloud datasets given the configuration information in the STAC records that are searched" which is something that could be really useful for our project here. I.e. it automatically can load the data into xarray based on the stuff in the STAC catalogue. I feel like DataPoint is close to being the API we are looking for.
@dolegi could you take a look too and see what you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants