Add aggregate data.json as downloadable catalog entry and queryable API endpoint #315

ajturner · 2014-02-24T04:39:48Z

Data.gov is populated by each agency that is required to have a /data.json catalog endpoint.

Therefore, it seems that Data.gov should exemplify this by having it's own http://data.gov/data.json endpoint for aggregate harvesting.

The text was updated successfully, but these errors were encountered:

philipashlock · 2014-02-24T16:12:31Z

Yep, the same data is already available via the API, but not formatted as data.json. You'll just want to paginate through with the "rows" and "start" parameters like http://catalog.data.gov/api/3/action/package_search?rows=1000&start=2000

This should probably be two separate requests: 1) add aggregate data.json as a catalog entry and 2) as a paginated/queryable API endpoint. Updating title to reflect that.

MarionRoyal · 2014-02-27T14:49:13Z

@ajturner
Andrew, Are you suggesting that there should be a data.gov/data.json file?

rebeccawilliams · 2014-09-24T00:15:45Z

Flagging these two related Issues as well: #398, #420

kvuppala · 2014-10-29T18:37:08Z

Explore https://github.com/ckan/ckanapi for data.json generation

ajturner · 2014-10-29T18:43:28Z

@philipashlock @rebeccawilliams So I'm confused. If DCAT is the proscribed & required format, why would Data.gov then use a different format spec?

Back to other discussions, DCAT needs parameterization for pagination and querying. However the results array can still abide by the same DCAT spec.

If it's going to vary, then that defeats the purpose of a spec, and requires additional development, testing and general confusing about what format any 'API' endpoint provides.

ajturner · 2014-10-29T18:46:40Z

@MarionRoyal I think it's an unfortunate confusion if data.json is a File or a Service. In my perspective, this is an implementation detail that is irrelevant in the Hypermedia definition of the web interfaces. Generally, data.json is the access for the catalog that is consistent across any catalog. It can be generated dynamically by a database, or hand-written. What matters is that it abides by web practices: durable URI, Mime types, caching headers, etc.

philipashlock · 2014-11-11T00:05:41Z

@kvuppala @ajturner Not sure how this got closed, it's definitely still a goal for us to be able to provide copy of the whole catalog using the data.json spec. We still have some technical issues preventing this from working well which we're working on but we're also working on some conventions for mapping existing metadata to the data.json format without it being lossy. Right now, most of the data.json metadata derived from existing geospatial metadata records doesn't include all of the original metadata and data.gov still sources directly from existing geospatial metadata records rather than the data.json versions of those records.

kvuppala · 2014-11-11T16:52:38Z

@philipashlock correct it was close by mistake, per last comment we will evaluate https://github.com/ckan/ckanapi for this

kvuppala · 2014-12-05T16:27:43Z

Look at consolidating both the data-json extensions that are used in inventory and catalog.

kvuppala · 2015-08-21T20:19:20Z

@ajturner
Yes for both the inventory and catalog CKAN instances the export mapping is done to output the data.json file, however we intended to enable this export plugin only in inventory CKAN.

On the catalog CKAN we intended not enable this plugin end of the month, considering the huge number of datasets in the catalog. The data.json generation for the entire catalog might run into timeout issue sunless we do some batch processing on the server, but all the necessary code changes are in place for a organization level data.json export or even the full export for a smaller CKAN catalog.

Are you intending to reuse the code on this feature or planning to download the data from catalog in data.json format?

ajturner · 2015-08-23T13:53:10Z

@kvuppala we intend to use the catalog data.json in order to federate certain datasets to other catalogs - much the same as Data.gov does itself.

This question has become a requirement driven by US Federal Agencies that don't want to keep re-registering their data into multiple catalogs. From their perspective they should do it once and technology standards handle it showing up in other catalogs and platforms. So immediately we need support for filtered Data.json output (e.g. Theme: Climate )

If Batch processing a cached file generated nightly (or so) is an option, that seems to be a great first step. Second would be to permit filtering (so the response may be in the dozens or hundreds) and then to add pagination. For example http://catalog.data.gov/data.json?groups=climate5434

Data.gov stance?

The issue of DCAT and Data.json as inadequate for large catalogs has been raised many times. The response by @rgrp @philipashlock & specifically @benbalter has been:

A single .data.json file that's too large to easily manipulate would be a great problem to have.

I'd argue that the schema was written with developers in mind, not the ease of agency adoption (or data.gov). When the two are in conflict, we should err on the side of those we want to encourage to use the data, not those whose job it is to publish or organize the data.

Yet here we are - Data.gov requires submission of a standard but does not publish itself due to limitations of the specification. So either Project Open Data accepts that platforms publish whatever format they want and it's up to the catalog developer to work with those API or Data.gov leads the way to support round-tripping via standards with support from the many participants & implementers on the threads above.

rufuspollock · 2015-08-24T10:13:00Z

@ajturner I note we did make a stab at a Data Portals spec that sought to address this kind of issue more broadly:

http://spec.dataportals.org/

In particular, that spec did consider pagination (an earlier version was actually much more complex and specified an entire API). The actual spec source including issue tracker can be found at https://github.com/dataprotocols/data-catalog-spec

dportnoy · 2016-02-25T05:48:25Z

+1 for having access to a gov-wide data.json (DCAT schema). Pagination and other query params are fine to deal with size burden. Ability to round-trip back to upstream data sources is crucial for validation and improving quality of data catalog.

philipashlock · 2017-01-27T19:08:01Z

Currently, we still don't provide an aggregate file as a DCAT based JSON or in any other format, but we are currently implementing a script to at least provide an aggregate export of all metadata exposed by the CKAN API as a JSON Lines file. We'll keep the Data.gov Harvesting page updated with all the methods available including the CKAN API and CSW endpoint.

There's nothing inherent in the DCAT schema that prevents us from making this available, but we still have some inefficiencies and challenges in generating that output from CKAN even though the functionality is in place. Until we make the aggregate export from the CKAN API available for download, you can refer to this more complete run down of the API calls to understand how to access and filter the total number of metadata records.

Disclaimer: Data.gov also syndicates data from state and local governments. However, non-federal data sources are governed by different terms of service and often different licenses than Federal data. When using or harvesting data from Data.gov, please note this distinction. When harvesting large volumes of data or metadata through Data.gov, we recommend you filter for Federal sources and separate non-federal sources to avoid comingling metadata without making this distinction. You can filter for only Federal sources using the organization_type like organization_type:%22Federal+Government%22 as seen in some of the examples below.

Each of the URLs below is appended with &rows=0 to only display the count of results, but you can change this to paginate through the records to get a complete export. See the CKAN API documentation for more details.

https://catalog.data.gov/api/action/package_search?fq=((collection_package_id:*%20OR%20*:*)+AND+(type:dataset)+AND+(organization_type:%22Federal+Government%22))&rows=1000&start=2000

Total number of records using collections and organization filters

Total datasets equivalent to catalog.data.gov/dataset is
https://catalog.data.gov/api/action/package_search?fq=(type:dataset)&rows=0
--> 193,971 as of 01/27/2017

Total datasets including collections:
https://catalog.data.gov/api/action/package_search?fq=((collection_package_id:*%20OR%20*:*)+AND+(type:dataset))&rows=0
--> 2,180,661 as of 01/27/2017

Total packages of all types (dataset, harvest) including collections:
https://catalog.data.gov/api/action/package_search?fq=(collection_package_id:*%20OR%20*:*)&rows=0
--> 2,181,537 as of 01/27/2017

Total Federal datasets equivalent to catalog.data.gov/dataset?organization_type=Federal+Government is
https://catalog.data.gov/api/action/package_search?fq=(organization_type:%22Federal+Government%22+AND+(type:dataset))&rows=0
--> 151,422 as of 01/27/2017

Total Federal datasets including collections:
https://catalog.data.gov/api/action/package_search?fq=((collection_package_id:*%20OR%20*:*)+AND+(type:dataset)+AND+(organization_type:%22Federal+Government%22))&rows=0
-->2,137,871 as of 01/27/2017

rebeccawilliams · 2017-01-27T20:14:34Z

This is really helpful!

rebeccawilliams · 2017-01-27T20:28:47Z

It might be helpful to include geospatial carve outs too:

https://catalog.data.gov/api/action/package_search?fq=(organization_type:%22Federal+Government%22+AND+(type:dataset)+AND+(metadata_type:geospatial))&rows=0
--> 94,785 as of 01/27/2017
this is working/matches the UI: https://catalog.data.gov/dataset?metadata_type=geospatial&organization_type=Federal+Government
https://catalog.data.gov/api/action/package_search?fq=(organization_type:%22Federal+Government%22+AND+(type:dataset)+AND+(metadata_type=non-geospatial))&rows=0
--> 831? as of 01/27/2017
this is not? working/does not match the UI: https://catalog.data.gov/dataset?organization_type=Federal+Government&metadata_type=non-geospatial

philipashlock · 2017-01-27T21:16:06Z

@rebeccawilliams it looks like you were using = instead of : for your queries and I think the only way to filter for non-spatial is to negate the search for metadata_type:geospatial

For geospatial:
https://catalog.data.gov/api/action/package_search?fq=(organization_type:%22Federal+Government%22)+AND+(type:dataset)+AND+(metadata_type:geospatial)&rows=0

For non-geospatial:
https://catalog.data.gov/api/action/package_search?fq=(organization_type:%22Federal+Government%22)+AND+(type:dataset)+AND+-(metadata_type:geospatial)&rows=0

rebeccawilliams · 2017-01-28T20:39:51Z

@philipashlock thanks! 56637, adds up ✅

DRDWIGHTSANDERS · 2017-02-13T00:56:12Z

FEDERAL STAKEHOLDER

kvuppala · 2017-03-20T06:48:20Z

@FuhuXia to setup the monthly cron to setup the export (GSA/data.gov#315 (comment)) process

kvuppala · 2017-09-25T15:09:57Z

mothly jsonline export files will be available here - https://filestore.data.gov/gsa/catalog/jsonl/

JeanneHolm assigned gbinal Apr 3, 2014

JeanneHolm added this to the Version 2.5 milestone Apr 3, 2014

dialsunny removed this from the Version 2.5 milestone Apr 8, 2014

philipashlock added this to the Release 2.11 milestone Sep 23, 2014

rebeccawilliams added feature-request usability public feedback labels Sep 24, 2014

philipashlock modified the milestones: Release 2.12, Release 2.11 Sep 25, 2014

philipashlock assigned philipashlock and unassigned gbinal Sep 25, 2014

This was referenced Sep 29, 2014

Location of data.json #420

Closed

data.json request causes internal error #398

Closed

rebeccawilliams mentioned this issue Oct 2, 2014

Trying to request catalog.data.gov/data.json results in error #475

Closed

rebeccawilliams mentioned this issue Oct 16, 2014

Data.gov API for available datasets #478

Closed

kvuppala closed this as completed Oct 29, 2014

philipashlock reopened this Nov 11, 2014

kvuppala modified the milestones: Release 2.13, Release 2.12 Nov 11, 2014

kvuppala assigned ydave-reisys and unassigned philipashlock Dec 5, 2014

kvuppala modified the milestones: Release 2.21, Release 2.20 Sep 8, 2015

kvuppala removed this from the Release 2.20.1 milestone Sep 30, 2015

kvuppala assigned philipashlock and unassigned vasili4 Sep 30, 2015

philipashlock mentioned this issue May 18, 2016

How should I download the entire catalog? #730

Closed

kvuppala added Backlog Review and removed Backlog labels Mar 20, 2017

kvuppala assigned FuhuXia Mar 20, 2017

kvuppala added In Progress and removed Review labels May 25, 2017

JJediny mentioned this issue May 26, 2017

Implement sitemap xml for CKAN data catalog #769

Closed

kvuppala added Ready and removed In Progress labels Sep 13, 2017

kvuppala closed this as completed Sep 25, 2017

kvuppala removed the Ready label Sep 25, 2017

kvuppala mentioned this issue Sep 26, 2017

aggregate data.json from catalog.data.gov in POD schema spec #806

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add aggregate data.json as downloadable catalog entry and queryable API endpoint #315

Add aggregate data.json as downloadable catalog entry and queryable API endpoint #315

ajturner commented Feb 24, 2014

philipashlock commented Feb 24, 2014

MarionRoyal commented Feb 27, 2014

rebeccawilliams commented Sep 24, 2014

kvuppala commented Oct 29, 2014

ajturner commented Oct 29, 2014

ajturner commented Oct 29, 2014

philipashlock commented Nov 11, 2014

kvuppala commented Nov 11, 2014

kvuppala commented Dec 5, 2014

kvuppala commented Aug 21, 2015

ajturner commented Aug 23, 2015

rufuspollock commented Aug 24, 2015

dportnoy commented Feb 25, 2016

philipashlock commented Jan 27, 2017

rebeccawilliams commented Jan 27, 2017

rebeccawilliams commented Jan 27, 2017

philipashlock commented Jan 27, 2017

rebeccawilliams commented Jan 28, 2017

DRDWIGHTSANDERS commented Feb 13, 2017

kvuppala commented Mar 20, 2017

kvuppala commented Sep 25, 2017

Add aggregate data.json as downloadable catalog entry and queryable API endpoint #315

Add aggregate data.json as downloadable catalog entry and queryable API endpoint #315

Comments

ajturner commented Feb 24, 2014

philipashlock commented Feb 24, 2014

MarionRoyal commented Feb 27, 2014

rebeccawilliams commented Sep 24, 2014

kvuppala commented Oct 29, 2014

ajturner commented Oct 29, 2014

ajturner commented Oct 29, 2014

philipashlock commented Nov 11, 2014

kvuppala commented Nov 11, 2014

kvuppala commented Dec 5, 2014

kvuppala commented Aug 21, 2015

ajturner commented Aug 23, 2015

Data.gov stance?

rufuspollock commented Aug 24, 2015

dportnoy commented Feb 25, 2016

philipashlock commented Jan 27, 2017

Total number of records using collections and organization filters

rebeccawilliams commented Jan 27, 2017

rebeccawilliams commented Jan 27, 2017

philipashlock commented Jan 27, 2017

rebeccawilliams commented Jan 28, 2017

DRDWIGHTSANDERS commented Feb 13, 2017

kvuppala commented Mar 20, 2017

kvuppala commented Sep 25, 2017