Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requirements for data loaders for scivision #511

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions docs/scip/0004-data-loading.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Template for new SCIPs

## Metadata

Editor:
...

Status (raw | draft | stable | deprecated | retired):
raw

## Description

...

## Requirements

### What is included in the catalog entry for a datasource?

- A URL to a remote location (as given below)
- The URL should be a browsable location, structured according to one of the supported 'data-sharing patterns' (see below)
-

- An indication of additional scivision plugins required to load the data, if not (?)

### Image formats
ots22 marked this conversation as resolved.
Show resolved Hide resolved

- Built-in support for any common format (via a library, such as skimage)
- Built-in support for formats common across scientific domains, not included in the above
- Whether to support a given format should be considered against the cost of the additional dependencies it requires, and the burden of these (e.g. something that makes core scivision less portable, or adds an extra installation step might be rejected, but a single python-only dependency considered acceptable)

- A 'plugin' system for extending to additional formats
### Loaded data formats

ots22 marked this conversation as resolved.
Show resolved Hide resolved
- Lazy loading with `dask`/`xarray`
- Simpler format such as `numpy` for smaller datasets when lazy load/ parallel computing not required
#### Additional image formats

Below is a list of additional image formats to consider for built-in support

-
-


### Supported data services

#### Notes

- 'Core' scivision (without additional) should maintain support for several remote data is commonly archived.
ots22 marked this conversation as resolved.
Show resolved Hide resolved

- Often locations are specified using URLs with a http/https scheme, but e.g. directory browsing is not supported by http, which limits the generality or usefulness of this approach.

- One possibility that is supported by plain http is a direct link to 'archive' file system (e.g. a zip file containing one of the patterns below).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way does seem like it would be easy for a data provider to comply with

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is at least partly supported already, but it seemed worth being explicit about the issue, and that a zip file (or other archive) is available as a single-file option. One motivation for having special handling of other data hosting services is so data providers have a better chance of including their data as-is, but with this option as a fallback, potentially.


- Examples consisting of a single image are supported for the same reason, but might not be particularly interesting

- A single file containing some metadata for Intake or a scivision plugin is another possibility

#### Particular services to support

- Automatic support for single image files and archives
- The URL of an Intake catalog
- The URL of some data-plugin metadata
- Zenodo
- GitHub
- Support layers which might be useful for labelling and inspection of model outputs e.g. Points, Shapes, Surface, Tracks and Vectors

#### Pull requests accepted

- Improve, updating, maintaining the existing supported services (e.g. fixing the library to work after an API change

- Adding support for other common remote locations (a test might be: are there two or more independent data sources in the catalog that)


### Native support for common data sharing patterns

#### Directory of image files

#### Image + csv labels

#### An Intake yaml file catalog

#### A yaml file, with metadata for a custom data plugin


## High-level software design

For Scivision.Py

- Consider using fsspec for handling remote locations (get archive support, variety of URL schemes)

- Abstract base class for a `DataService`

## Remaining questions

...