Skip to content
This repository has been archived by the owner on Mar 15, 2022. It is now read-only.

Add aggregate data.json as downloadable catalog entry and queryable API endpoint #315

Closed
ajturner opened this issue Feb 24, 2014 · 27 comments

Comments

@ajturner
Copy link

Data.gov is populated by each agency that is required to have a /data.json catalog endpoint.

Therefore, it seems that Data.gov should exemplify this by having it's own http://data.gov/data.json endpoint for aggregate harvesting.

@philipashlock
Copy link
Member

Yep, the same data is already available via the API, but not formatted as data.json. You'll just want to paginate through with the "rows" and "start" parameters like http://catalog.data.gov/api/3/action/package_search?rows=1000&start=2000

This should probably be two separate requests: 1) add aggregate data.json as a catalog entry and 2) as a paginated/queryable API endpoint. Updating title to reflect that.

@MarionRoyal
Copy link
Contributor

@ajturner
Andrew, Are you suggesting that there should be a data.gov/data.json file?

@JeanneHolm JeanneHolm added this to the Version 2.5 milestone Apr 3, 2014
@dialsunny dialsunny removed this from the Version 2.5 milestone Apr 8, 2014
@philipashlock philipashlock added this to the Release 2.11 milestone Sep 23, 2014
@rebeccawilliams
Copy link
Contributor

Flagging these two related Issues as well: #398, #420

@kvuppala
Copy link
Contributor

Explore https://github.com/ckan/ckanapi for data.json generation

@ajturner
Copy link
Author

@philipashlock @rebeccawilliams So I'm confused. If DCAT is the proscribed & required format, why would Data.gov then use a different format spec?

Back to other discussions, DCAT needs parameterization for pagination and querying. However the results array can still abide by the same DCAT spec.

If it's going to vary, then that defeats the purpose of a spec, and requires additional development, testing and general confusing about what format any 'API' endpoint provides.

@ajturner
Copy link
Author

@MarionRoyal I think it's an unfortunate confusion if data.json is a File or a Service. In my perspective, this is an implementation detail that is irrelevant in the Hypermedia definition of the web interfaces. Generally, data.json is the access for the catalog that is consistent across any catalog. It can be generated dynamically by a database, or hand-written. What matters is that it abides by web practices: durable URI, Mime types, caching headers, etc.

@philipashlock
Copy link
Member

@kvuppala @ajturner Not sure how this got closed, it's definitely still a goal for us to be able to provide copy of the whole catalog using the data.json spec. We still have some technical issues preventing this from working well which we're working on but we're also working on some conventions for mapping existing metadata to the data.json format without it being lossy. Right now, most of the data.json metadata derived from existing geospatial metadata records doesn't include all of the original metadata and data.gov still sources directly from existing geospatial metadata records rather than the data.json versions of those records.

@philipashlock philipashlock reopened this Nov 11, 2014
@kvuppala
Copy link
Contributor

@philipashlock correct it was close by mistake, per last comment we will evaluate https://github.com/ckan/ckanapi for this

@kvuppala kvuppala modified the milestones: Release 2.13, Release 2.12 Nov 11, 2014
@kvuppala
Copy link
Contributor

kvuppala commented Dec 5, 2014

Look at consolidating both the data-json extensions that are used in inventory and catalog.

@kvuppala
Copy link
Contributor

@ajturner
Yes for both the inventory and catalog CKAN instances the export mapping is done to output the data.json file, however we intended to enable this export plugin only in inventory CKAN.

On the catalog CKAN we intended not enable this plugin end of the month, considering the huge number of datasets in the catalog. The data.json generation for the entire catalog might run into timeout issue sunless we do some batch processing on the server, but all the necessary code changes are in place for a organization level data.json export or even the full export for a smaller CKAN catalog.

Are you intending to reuse the code on this feature or planning to download the data from catalog in data.json format?

@ajturner
Copy link
Author

@kvuppala we intend to use the catalog data.json in order to federate certain datasets to other catalogs - much the same as Data.gov does itself.

This question has become a requirement driven by US Federal Agencies that don't want to keep re-registering their data into multiple catalogs. From their perspective they should do it once and technology standards handle it showing up in other catalogs and platforms. So immediately we need support for filtered Data.json output (e.g. Theme: Climate )

If Batch processing a cached file generated nightly (or so) is an option, that seems to be a great first step. Second would be to permit filtering (so the response may be in the dozens or hundreds) and then to add pagination. For example http://catalog.data.gov/data.json?groups=climate5434

Data.gov stance?

The issue of DCAT and Data.json as inadequate for large catalogs has been raised many times. The response by @rgrp @philipashlock & specifically @benbalter has been:

A single .data.json file that's too large to easily manipulate would be a great problem to have.

I'd argue that the schema was written with developers in mind, not the ease of agency adoption (or data.gov). When the two are in conflict, we should err on the side of those we want to encourage to use the data, not those whose job it is to publish or organize the data.

Yet here we are - Data.gov requires submission of a standard but does not publish itself due to limitations of the specification. So either Project Open Data accepts that platforms publish whatever format they want and it's up to the catalog developer to work with those API or Data.gov leads the way to support round-tripping via standards with support from the many participants & implementers on the threads above.

@rufuspollock
Copy link

@ajturner I note we did make a stab at a Data Portals spec that sought to address this kind of issue more broadly:

http://spec.dataportals.org/

In particular, that spec did consider pagination (an earlier version was actually much more complex and specified an entire API). The actual spec source including issue tracker can be found at https://github.com/dataprotocols/data-catalog-spec

@kvuppala kvuppala modified the milestones: Release 2.21, Release 2.20 Sep 8, 2015
@kvuppala kvuppala removed this from the Release 2.20.1 milestone Sep 30, 2015
@kvuppala kvuppala assigned philipashlock and unassigned vasili4 Sep 30, 2015
@dportnoy
Copy link

+1 for having access to a gov-wide data.json (DCAT schema). Pagination and other query params are fine to deal with size burden. Ability to round-trip back to upstream data sources is crucial for validation and improving quality of data catalog.

@philipashlock
Copy link
Member

Currently, we still don't provide an aggregate file as a DCAT based JSON or in any other format, but we are currently implementing a script to at least provide an aggregate export of all metadata exposed by the CKAN API as a JSON Lines file. We'll keep the Data.gov Harvesting page updated with all the methods available including the CKAN API and CSW endpoint.

There's nothing inherent in the DCAT schema that prevents us from making this available, but we still have some inefficiencies and challenges in generating that output from CKAN even though the functionality is in place. Until we make the aggregate export from the CKAN API available for download, you can refer to this more complete run down of the API calls to understand how to access and filter the total number of metadata records.


Disclaimer: Data.gov also syndicates data from state and local governments. However, non-federal data sources are governed by different terms of service and often different licenses than Federal data. When using or harvesting data from Data.gov, please note this distinction. When harvesting large volumes of data or metadata through Data.gov, we recommend you filter for Federal sources and separate non-federal sources to avoid comingling metadata without making this distinction. You can filter for only Federal sources using the organization_type like organization_type:%22Federal+Government%22 as seen in some of the examples below.


Each of the URLs below is appended with &rows=0 to only display the count of results, but you can change this to paginate through the records to get a complete export. See the CKAN API documentation for more details.

https://catalog.data.gov/api/action/package_search?fq=((collection_package_id:*%20OR%20*:*)+AND+(type:dataset)+AND+(organization_type:%22Federal+Government%22))&rows=1000&start=2000

Total number of records using collections and organization filters

Total datasets equivalent to catalog.data.gov/dataset is
https://catalog.data.gov/api/action/package_search?fq=(type:dataset)&rows=0
--> 193,971 as of 01/27/2017

Total datasets including collections:
https://catalog.data.gov/api/action/package_search?fq=((collection_package_id:*%20OR%20*:*)+AND+(type:dataset))&rows=0
--> 2,180,661 as of 01/27/2017

Total packages of all types (dataset, harvest) including collections:
https://catalog.data.gov/api/action/package_search?fq=(collection_package_id:*%20OR%20*:*)&rows=0
--> 2,181,537 as of 01/27/2017

Total Federal datasets equivalent to catalog.data.gov/dataset?organization_type=Federal+Government is
https://catalog.data.gov/api/action/package_search?fq=(organization_type:%22Federal+Government%22+AND+(type:dataset))&rows=0
--> 151,422 as of 01/27/2017

Total Federal datasets including collections:
https://catalog.data.gov/api/action/package_search?fq=((collection_package_id:*%20OR%20*:*)+AND+(type:dataset)+AND+(organization_type:%22Federal+Government%22))&rows=0
-->2,137,871 as of 01/27/2017

@rebeccawilliams
Copy link
Contributor

This is really helpful!

@philipashlock
Copy link
Member

@rebeccawilliams it looks like you were using = instead of : for your queries and I think the only way to filter for non-spatial is to negate the search for metadata_type:geospatial

For geospatial:
https://catalog.data.gov/api/action/package_search?fq=(organization_type:%22Federal+Government%22)+AND+(type:dataset)+AND+(metadata_type:geospatial)&rows=0

For non-geospatial:
https://catalog.data.gov/api/action/package_search?fq=(organization_type:%22Federal+Government%22)+AND+(type:dataset)+AND+-(metadata_type:geospatial)&rows=0

@rebeccawilliams
Copy link
Contributor

@philipashlock thanks! 56637, adds up ✅

@DRDWIGHTSANDERS
Copy link

FEDERAL STAKEHOLDER

@kvuppala
Copy link
Contributor

@FuhuXia to setup the monthly cron to setup the export (GSA/data.gov#315 (comment)) process

@kvuppala
Copy link
Contributor

mothly jsonline export files will be available here - https://filestore.data.gov/gsa/catalog/jsonl/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests