Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk catalog access option: access to all datasets in a single single file #7

Open
rufuspollock opened this issue Nov 22, 2012 · 5 comments

Comments

@rufuspollock
Copy link
Member

This proposes a substantive change to the DCIP spec. Key features

  • Provision of all datasets in a single file
  • Format would be a simple list of each dataset with each dataset serialized as in DCIP
    • with default of JSON but options for n3 etc
  • Location likely specified by a meta field in head as per API location

This option could be provided both in addition to and as substitute for the full API option.

Benefits:

  • Catalog operators:
    • simpler and easier to do than a full API. Very easy to get started.
  • Consumers:
    • All datasets in one go - no need to walk through the API

Possible problems:

  • Catalog operators:
    • For larger catalogs the file is very large. Inefficient both for creation, storage, and transmission.
  • Consumers:
    • File could be large if catalog is large.
    • to get whole file even if only one dataset has changed

To Discuss

  • Sign-posting to this file
  • Relationship to REST option (is this entirely orthogonal?)
@willpugh
Copy link

I think it would actually be nice if there could be two levels of compliance, where the single file download looked like a valid REST endpoint, but with reduced functionality. One level of compliance would be built with Catalogs in mind, and one built mainly with Data Sources in mind. The bulk catalog operations you mention would mainly be targeted at data sources or very small catalogs.

So, if it were structured so that the dataset endpoint would return the JSON of the catalog up to the first 1000 records, and would return full results (so a sync would not require getting the list and then doing a full round trip for each dataset).

Then, any catalog with less than 1000 datasets would be able to provide the simpler access. Any catalog that was providing many datasets, would be required to provide a higher compliance level that provided paging and query by change date, etc.

@tgherzog
Copy link

My initial reaction is that from a consumer standpoint the important thing is to have a consistent protocol across all catalogs that implement DCIP, regardless of the size of the catalog.

One approach might be to standardize the variable names used in the responses from the List Dataset API and the Dataset API.

  • Both APIs current implement the "id" field consistently
  • The "revision" field in the List Data API is named "version" in the Dataset API
  • The "modified" field in the List Data API looks like it's named "metadata_modified" in the Dataset API (are the meanings consistent)?
  • The "change_type" and "url" fields from the List Data API are missing from the Dataset API, but could be included in the latter. Perhaps "url" should then be renamed to be less ambiguous.

If you standardized fields in this manner, then the catalog could include as many "full dataset" fields as it wanted (beyond the required ones) in the bulk catalog (aka List Data API) listing. For example:

[
  {
    id: "123",
    revision: "1",
    url: "http://data.worldbank.org/catalog/123.json",
    modified: "2012-06-01",
    change_type: "update",
    title: "123 Data",
    publisher: "http://www.worldbank.org", // um, literal or resource here?
    // etc
  },
]

In other words, the bulk catalog and the List Data API are the same, from the consumer's standpoint.

In practice, different catalogs would have the discretion of publishing either complete, nearly complete, or sparse datasets in the bulk catalog, depending on their respective implementations.

Consumers would start by accessing the bulk catalog listing, and request any missing fields via the "url" field.

@rufuspollock
Copy link
Member Author

@willpugh nice suggestion. I guess we still need a way to signal your level of compliance?

@rufuspollock
Copy link
Member Author

Adding a comment from @tgherzog which seems to have gone missing:

My initial reaction is that from a consumer standpoint the important thing is to have a consistent protocol across all catalogs that implement DCIP, regardless of the size of the catalog.

One approach might be to standardize the variable names used in the responses from the List Dataset API and the Dataset API.

  • Both APIs current implement the "id" field consistently
  • The "revision" field in the List Data API is named "version" in the Dataset API
  • The "modified" field in the List Data API looks like it's named "metadata_modified" in the Dataset API (are the meanings consistent)?
  • The "change_type" and "url" fields from the List Data API are missing from the Dataset API, but could be included in the latter. Perhaps "url" should then be renamed to be less ambiguous.

If you standardized fields in this manner, then the catalog could include as many "full dataset" fields as it wanted (beyond the required ones) in the bulk catalog (aka List Data API) listing. For example:

[
  {
    id: "123",
    revision: "1",
    url: "http://data.worldbank.org/catalog/123.json",
    modified: "2012-06-01",
    change_type: "update",
    title: "123 Data",
    publisher: "http://www.worldbank.org", // um, literal or resource here?
    // etc
  },
]

In other words, the bulk catalog and the List Data API are the same, from the consumer's standpoint.

In practice, different catalogs would have the discretion of publishing either complete, nearly complete, or sparse datasets in the bulk catalog, depending on their respective implementations.

Consumers would start by accessing the bulk catalog listing, and request any missing fields via the "url" field.

@willpugh
Copy link

I like tgherzog's suggestions. I think consistency between the listing APIs and the Dataset API is a good thing in general, and makes this case easier.

There are 3 reasonable suggestions here:
1) Caller just "Figures it out", by following tgherzog's approach, and in the case that they cannot reference an ID directly, they only index what was in list page.
2) The Catalog Entity gets more fleshed out, and it exists there. This entity could exist as an endpoint that could be referenced as a file as well.
3) This could be in the of the homepage as well, e.g.

<meta content="dcip-basic-rest-compliance" value="minimum" />

or

I think #3 seems more elegant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants