-
Notifications
You must be signed in to change notification settings - Fork 411
Add aggregate data.json as downloadable catalog entry and queryable API endpoint #315
Comments
Yep, the same data is already available via the API, but not formatted as data.json. You'll just want to paginate through with the "rows" and "start" parameters like http://catalog.data.gov/api/3/action/package_search?rows=1000&start=2000 This should probably be two separate requests: 1) add aggregate data.json as a catalog entry and 2) as a paginated/queryable API endpoint. Updating title to reflect that. |
@ajturner |
Explore https://github.com/ckan/ckanapi for data.json generation |
@philipashlock @rebeccawilliams So I'm confused. If DCAT is the proscribed & required format, why would Data.gov then use a different format spec? Back to other discussions, DCAT needs parameterization for pagination and querying. However the If it's going to vary, then that defeats the purpose of a spec, and requires additional development, testing and general confusing about what format any 'API' endpoint provides. |
@MarionRoyal I think it's an unfortunate confusion if data.json is a File or a Service. In my perspective, this is an implementation detail that is irrelevant in the Hypermedia definition of the web interfaces. Generally, data.json is the access for the catalog that is consistent across any catalog. It can be generated dynamically by a database, or hand-written. What matters is that it abides by web practices: durable URI, Mime types, caching headers, etc. |
@kvuppala @ajturner Not sure how this got closed, it's definitely still a goal for us to be able to provide copy of the whole catalog using the data.json spec. We still have some technical issues preventing this from working well which we're working on but we're also working on some conventions for mapping existing metadata to the data.json format without it being lossy. Right now, most of the data.json metadata derived from existing geospatial metadata records doesn't include all of the original metadata and data.gov still sources directly from existing geospatial metadata records rather than the data.json versions of those records. |
@philipashlock correct it was close by mistake, per last comment we will evaluate https://github.com/ckan/ckanapi for this |
Look at consolidating both the data-json extensions that are used in inventory and catalog. |
@ajturner On the catalog CKAN we intended not enable this plugin end of the month, considering the huge number of datasets in the catalog. The data.json generation for the entire catalog might run into timeout issue sunless we do some batch processing on the server, but all the necessary code changes are in place for a organization level data.json export or even the full export for a smaller CKAN catalog. Are you intending to reuse the code on this feature or planning to download the data from catalog in data.json format? |
@kvuppala we intend to use the catalog data.json in order to federate certain datasets to other catalogs - much the same as Data.gov does itself. This question has become a requirement driven by US Federal Agencies that don't want to keep re-registering their data into multiple catalogs. From their perspective they should do it once and technology standards handle it showing up in other catalogs and platforms. So immediately we need support for filtered Data.json output (e.g. Theme: Climate ) If Batch processing a cached file generated nightly (or so) is an option, that seems to be a great first step. Second would be to permit filtering (so the response may be in the dozens or hundreds) and then to add pagination. For example Data.gov stance?The issue of DCAT and Data.json as inadequate for large catalogs has been raised many times. The response by @rgrp @philipashlock & specifically @benbalter has been:
Yet here we are - Data.gov requires submission of a standard but does not publish itself due to limitations of the specification. So either Project Open Data accepts that platforms publish whatever format they want and it's up to the catalog developer to work with those API or Data.gov leads the way to support round-tripping via standards with support from the many participants & implementers on the threads above. |
@ajturner I note we did make a stab at a Data Portals spec that sought to address this kind of issue more broadly: In particular, that spec did consider pagination (an earlier version was actually much more complex and specified an entire API). The actual spec source including issue tracker can be found at https://github.com/dataprotocols/data-catalog-spec |
+1 for having access to a gov-wide data.json (DCAT schema). Pagination and other query params are fine to deal with size burden. Ability to round-trip back to upstream data sources is crucial for validation and improving quality of data catalog. |
Currently, we still don't provide an aggregate file as a DCAT based JSON or in any other format, but we are currently implementing a script to at least provide an aggregate export of all metadata exposed by the CKAN API as a JSON Lines file. We'll keep the Data.gov Harvesting page updated with all the methods available including the CKAN API and CSW endpoint. There's nothing inherent in the DCAT schema that prevents us from making this available, but we still have some inefficiencies and challenges in generating that output from CKAN even though the functionality is in place. Until we make the aggregate export from the CKAN API available for download, you can refer to this more complete run down of the API calls to understand how to access and filter the total number of metadata records. Disclaimer: Data.gov also syndicates data from state and local governments. However, non-federal data sources are governed by different terms of service and often different licenses than Federal data. When using or harvesting data from Data.gov, please note this distinction. When harvesting large volumes of data or metadata through Data.gov, we recommend you filter for Federal sources and separate non-federal sources to avoid comingling metadata without making this distinction. You can filter for only Federal sources using the Each of the URLs below is appended with
Total number of records using collections and organization filtersTotal datasets equivalent to catalog.data.gov/dataset is Total datasets including collections: Total packages of all types (dataset, harvest) including collections: Total Federal datasets equivalent to catalog.data.gov/dataset?organization_type=Federal+Government is Total Federal datasets including collections: |
This is really helpful! |
It might be helpful to include geospatial carve outs too:
|
@rebeccawilliams it looks like you were using For non-geospatial: |
@philipashlock thanks! 56637, adds up ✅ |
FEDERAL STAKEHOLDER |
@FuhuXia to setup the monthly cron to setup the export (GSA/data.gov#315 (comment)) process |
mothly jsonline export files will be available here - https://filestore.data.gov/gsa/catalog/jsonl/ |
Data.gov is populated by each agency that is required to have a
/data.json
catalog endpoint.Therefore, it seems that Data.gov should exemplify this by having it's own http://data.gov/data.json endpoint for aggregate harvesting.
The text was updated successfully, but these errors were encountered: