-
Notifications
You must be signed in to change notification settings - Fork 38
Simplified data registry and dataset interface #149
Conversation
@vidartf @telamonian I am also contemplating these additional tasks:
I have tried to capture the original use cases along with these ideas in user stories added in the next section. cc @ellisonbg Data Registry User Stories
|
I can spend some time playing around with this next week (family vacation this week). At a glance:
permanently splitting datasets into train/ test ahead of time is bad practice |
@aiqc
For path/location, one option is to use "id" property. The interface so far is flexible enough to allow any JSON based schema in the "metadata" and "value" properties. It is expected that the implementors will use these to come up with a variety of schemas and we can formalize some of them. For example, we should define a strict schema for a dataset with tab-separated file stored in S3, with column names, dtypes etc. I would define this schema like this. interface S3 {
bucket: string
object: string
}
interface DataColumn {
name: string
dtype: string
}
interface ITabSeparated {
lineDelimiter: string,
colDelimiter: string,
columns: DataColumn[]
}
interface IS3TsvMetadata {
storage: S3,
serialization: ITabSeparated
}
// Use the above schema to register a dataset
registry.register<JSONValue, IS3TsvMetadata>({
id: "s3://datasets/covid19-dataset",
abstractDataType: "tabular",
storageType: "s3",
serializationType: "tsv",
value: null,
metadata: {
storage: {
bucket: "datasets",
object: "covid19-dataset"
},
serialization: {
lineDelimiter: "\n",
colDelimiter: "\t",
columns: [
{
"name": "country",
"dtype": "string"
},
{
"name": "state",
"dtype": "string"
},
{
"name": "cases",
"dtype": "number"
},
{
"name": "reported",
"dtype": "datetime"
}
]
}
}
});
"inmemory" here just signified a non-remote dataset, dataframes created from pandas for example could be declared "inmemory". This definition here doesn't inherently do anything to support specific formats/libraries, this is merely a way to specify what the storage type is. It is totally open to the implementor to handle specific storage formats. However, I am looking into how to allow users to register datasets directly from the cell when they create a pandas dataframe, we can discuss more if pickle and dill should be supported in this context.
Can you elaborate on this.
Agree, the example I had was just a representation of a dataset with multiple folders/files. |
…terlab-data-explorer into 3coins-data-registry
@3coins as discussed, done some edits, >>>>highlighted<<<< Data Registry User Stories
|
This project is still a work in progress. That said, I think we can explore how this applies to the datasets you mentioned here, and identify if there are any gaps. One key point to consider here is that DataRegistry is not intended to serve as a catalog, but rather act as a central artifact to manage "My Datasets" (Datasets I am working with) within a JupyterLab instance. |
Resolves #145, #146, and #147.
Rekindling the data registry project with a simplified DataRegistry and Dataset interfaces. The package has been re-written from scratch using the latest
cookiecutter
template with all the setup for unit and integration tests along with GitHub actions to aid automated build check, changelog, and release workflows.Install
Data Registry provides 3 main components:
Data Registry provides both a TypeScript and a Python API to register new datasets. The TypeScript interface allows data providers to register new datasets via plugins; this also allows extension writers to create commands and associate these commands with a specific dataset type. The Python API provides an additional way to register datasets inside notebooks; this also enables registering dataset definition stored in files with “dataset” extension. Dataset providers can share dataset files or notebooks containing dataset definitions with JupyterLab users.
Dataset Interface
A typed JSON based extensible interface that can be used to represent any kind of dataset. Dataset interface expects 2 types, one for defining the value and second for the metadata. Having these as typed values allows creating any kind of dataset. In addition, the attributes abstract data type, storage type and serialization type are declared as string types which allows flexibility for data providers to define datasets that can span a vast range of mime types, with extensibility to support any future mime types.
Id
A string that represents the unique identifier for the dataset, is expected to be unique across a JupyterLab instance.
Abstract Data Type
A string that captures the abstract data type for the dataset, and is largely defined by the dataset provider. This property represents a very high level abstraction of the data type which might not conform to a specific mime type but rather provide a more general view of the dataset. Most datasets with features might fall under “tabular” because they will have a set of labels/columns with multiple rows of values. Some other examples are “image” to represent any single image, “image-collection” to represent a set of images, and “text” to represent free-form or structured text data.
Serialization Type
A string that captures information about the serialization format of the data. This property represents the specific subtype which can be used to serialize or visualize the data. For example, tabular datasets can be represented in “csv” or “tsv”, images might be “jpeg”, “png” etc. Some other examples are “text”, “json”, “svg”, “sql”.
Storage Type
A string that defines how the data is stored, e.g., S3, database, in memory etc.
Value
Nullable value which defines the type for the actual value of the dataset, e.g., for an in memory comma separated file, this might represent the actual string value of the data.
Metadata
Defines the type for capturing any metadata associated with the dataset that might help extension writers or JupyterLab users download/serialize the data. e.g., for a tabular dataset, this might capture the delimiter and line delimiter; for a dataset stored in S3, this might capture the credentials or folder and object information.
Title
A string that is used largely for display purposes to identify the dataset.
Description
A string that captures some more detail about the dataset, so extension developers and users have more context of what kind of data is stored in the dataset.
Tags
List of arbitrary strings that could be attached to the dataset which might aid searching and identification of similar datasets.
Version
A string that defines the version of the dataset.
Here are a few examples of real world datasets expressed using the dataset interface:
In memory dataset in CSV format
Directory of images stored in S3
Dataset stored in a SQL database
Note: The above example specifies a “dataset” vs a “datasource”. However, the dataset interface could be easily applied to datasources as well. For example, for a dataset definition to represent a collection of tables might be represented by an abstractDataType of “sql-tabular” with relevant details to connect/serialize in the metadata property.
Data Registry API
Data registry aims to catalog datasets in a JupyterLab environment, and provides APIs to register, update and retrieve datasets. These are the core set of APIs which will be shared with other extensions to help them manage datasets. Data registry provides these APIs:
Registering new datasets
Use the Typescript API
Plugins/extensions can access the data registry object by adding a dependency on the IDataRegistry interface. Plugins can add new datasets by using the “register” API from Data Registry. Here is an example of a plugin that registers a dataset.
Use Notebook (Python API)
In addition to the Typescript API, a Python Dataset class is provided to allow lab users to register datasets within the notebook. Creating a new instance of the Dataset class within a notebook cell and executing the cell will invoke the“register” API. Here is an example of using the Python API to register a new dataset.
Use a dataset file
Another way for users to register datasets would be to create a file with “dataset” extension with definition of dataset in JSON. This feature allows an object with a single dataset definition or an array to register multiple datasets. Opening the file will register all datasets defined inside the file. Here is an example of datasets defined in a dataset file.
Attach actions to datasets (Command Registry API)
Data registry provides APIs to register commands to specific dataset types, and retrieve a set of commands registered to those dataset types. This is useful for populating a variety of action-based widgets that perform an action associated with loading, visualizing or managing specific datasets.
Data providers can register commands with a specific dataset by using the “registerCommand” API.
Extension writers can use the “getCommands” API to get all commands registered for a specific data, serialization, and storage types. They can use these list of commands to bind actions to specific datasets. Specific datasets can be queried by using the “queryDataset” API. Here is an example of an extension that adds a panel in the lab launcher for each dataset registered with a command “render-csv”.
My Datasets UI
Data registry extension adds a dataset explorer panel to JupyterLab UI that will allow JupyterLab users to view and interact with all registered datasets within a single JupyterLab instance. This extension tracks dataset additions and updates via data registry within a JupyterLab session.
This widget also provides a context menu that allows execution of registered commands/actions associated with a dataset.
“Add Data” button allows users to register a new dataset by auto-populating a dataset creation template in a notebook cell; this can be edited and customized by the user to register a new dataset. This feature uses the python API for registering a new dataset. Executing the notebook cell will register the dataset defined inside the cell.
Open Questions
Types for abstract data, Serialization and storage
The properties Abstract Data Type, Serialization Type and Storage Type in the Data Registry API are all “string” types at the moment, so there is no schema that could be enforced on these values. There are several options to control these values. Here are some proposed solutions:
Record in documentation
Document and maintain all types in a repository as part of documentation.
PROS
CONS
Use Enums
Codify types in the data registry code as typescript enum values so these values are strongly typed.
PROS
CONS
Build an API
Provide an API to allow data providers and extension writers to add new type values.
PROS
CONS
Types (Generics) for value and metadata in Dataset
Dataset interface allows JSON typed values for the “value” and “metadata” properties. This is currently open to any interface/types, which introduces a chance that there might be duplication and redundant types. There is some benefit in standardizing these values for common dataset types. Here are a few approaches to tackle this.
Record interfaces in documentation
Add documentation for these interfaces for different dataset types.
PROS
CONS
Define typescript interfaces
Add interfaces, and allow only these interfaces to be used in data registry.
PROS
CONS
Hybrid approach
Define interfaces for certain dataset types, but also allow others to be documented and added within the registry api.
PROS
CONS