Single-file /data catalog not good--optional alternative suggested #105

jeffdlb · 2013-08-05T17:37:24Z

Current guidance is that each agency's "/data" inventory must be a single list in a file containing multiple lines of Javascript Object Notation (JSON) summary metadata per dataset, even if our agency has tens of thousands of datasets distributed across multiple facilities and servers. I believe the single list will pose problems of inventory creation, maintenance, and usability. I enumerate my concerns below, but first I propose a specific solution.

PROPOSAL:

I recommend the single-list approach be made optional. Specifically, I suggest that the top-level JSON file be permitted to include either a list of datasets or a list of child nodes. Each node would at minimum have 'title' and 'accessURL' elements from your JSON schema (http://project-open-data.github.io/schema/), an agreed-upon value of 'format' such as "inventory_node" to indicate the destination is not a data file, and optionally some useful elements (e.g., person, mbox, modified, accessLevel, etc) describing that node. Each node could likewise include either a list of datasets or a list of children.

CONCERNS REGARDING THE SINGLE-LIST APPROACH:

(1) We should not build these inventories only to support data.gov. We want to leverage this for other efforts internal to our agencies, for PARR, and to support other external portals such as the Global Earth Observing System of Systems (GEOSS) or the Global Change Master Directory (GCMD). A distributed organization will be more useful for them (even if data.gov itself could handle a single long unsorted list.)

(2) The inventory will need to be compiled from many different sources, including multiple web-accessible folders (WAFs) of geospatial metadata, existing catalog servers, or other databases or lists. Each type of input will need some code to produce the required subset of the list, and then some software will need to merge everything into a giant list. Failure at any step in the process may cause the inventory to be out of date, incomplete, or broken entirely. A distributed organization will more easily allow most of the inventory to be correct or complete even if one sub-part is not, and will require much less code for fault-handling.

(3) Some of our data changes very frequently, on timescales of minutes or hours, while other data are only modified yearly or less frequently. A distributed organization will more easily allow partial updates and the addition (or removal) of new collections of data without having to regenerate the entire list.

(4) The inventory is supposed to include both our scientific observations and "business" data, and both public and non-public data. That alone suggests a top-level division into (for example) /data/science, /data/business, and /data/internal. The latter may need to be on a separate machine with different access control.

(5) It would be easier to create usable parallel versions of the inventory in formats other than JSON (e.g., HTML with schema.org tags) if the organization were distributed.

(6) I understand that the data.gov harvester has successfully parsed very long JSON files. However, recursive traversing of a web-based directory tree-like structure would be trivial to implement by data.gov and would be more scalable and solve many problems for the agencies and the users. data.gov's own harvesting could even be helped if the last-modified date on each node is checked to determine whether you can skip it.

waldoj · 2013-08-05T17:48:30Z

@jeffdlb, is it fair to say that the Sitemap index standard is a good model here?

mhogeweg · 2013-08-05T17:56:00Z

@jeffdlb @waldoj I'd say so and had done so before in issue #27 where I suggested pagination like is seen in OpenSearch providers or the breaking up in smaller files as done in sitemap. I think Jeff's point about supporting needs not just of Data.gov but also think about how agencies support other initiatives also deserves consideration here.

benbalter · 2013-08-05T18:14:43Z

I'd argue that the schema was written with developers in mind, not the ease of agency adoption (or data.gov). When the two are in conflict, we should err on the side of those we want to encourage to use the data, not those whose job it is to publish or organize the data. Going into multiple formats may make things easier for agencies, but it does so at the average developer's expense.

This is a great case for practicality over purity. A single .data.json file that's too large to easily manipulate would be a great problem to have. It would mean agencies are indexing data and exposing it to the public, but as far as I can tell, looking through the open issues here, that problem remains theoretical and limited to government. Given the ease of adopting options like #27, I'd argue for a wait and see approach. Let the users' needs drive the product, not the publishers.

Practically, if we allow sub-data files two implications:

A lot more complex for developers (or data.gov or whatever) to crawl. The HTTP call is going to be the most expensive part of the transaction. Allowing sub-files doubles that transaction cost, not to mention the complexity of the crawler.
If I were a government agency, I‘d take advantage of it. I’d make a single data.json file in the root, and then have each bureau/office make their own data.json file, so there'd be no change whatsoever from the status quo where data is siloed.

Forcing agencies to make a single .data.json file is an exercise that helps the agency centralize their data index, to absorb the complexity of their own system on behalf of citizens, and begins the process of becoming more customer-centric when it comes to data.

waldoj · 2013-08-05T18:19:10Z

Each type of input will need some code to produce the required subset of the list, and then some software will need to merge everything into a giant list. Failure at any step in the process may cause the inventory to be out of date, incomplete, or broken entirely. A distributed organization will more easily allow most of the inventory to be correct or complete even if one sub-part is not, and will require much less code for fault-handling.

But doesn't the process that you're describing still necessitate that "some software will need to merge everything into a giant list"? It seems that complexity is being pushed out to the clients, rather than being resolved within the federal agency.

ddnebert · 2013-08-05T19:19:19Z

A subtlety in this proposal, and one I opened as a suggestion in github two months ago, is that we allow the data.json file to contain entries for existing standards-based catalogs or APIs. Each such collection - and we manage many in the geospatial domain - could be simply marked up, one entry per collection, within the agency's json file with the breadcrumbs to perform the query and or indexing. CKAN already has built-in harvesting capability on these established protocols, as well as json, so the integration challenge would be minor. It would produce the same, if not better results as a serialized subset of metadata in feeds, since it all ends up in the searchable index. The benefits of this approach are many:

Geospatial metadata are robust and support not only discovery but fitness-for-use information that end-users need to know before access. The json file does not contain sufficient metadata for end-user advice.
The hybrid solution supports standards and gains access to well over 90% of existing government data sets in the catalog.data.gov index. Proposing an alternative, non-standard solution does not provide new content or augment counts of data.
Data conversion burdens on the agencies are negligible. This is an elegant, least-effort solution as support for these protocols and formats is built into the CKAN index software already.
The hybrid solution supports traversing and indexing homogeneous data series (i.e. imagery and similar inventory catalogs) in a two-phase search, a feature not present in the json solution.
The catalog solution supports detection of resource changes (add, update) not supported in the json feed. Full traversal/re-indexeing of complete agency json files is currently required, whereas change detection is already supported in the protocols and harvest.

I propose the following text changes to the Implementation document, implementation-guide.md, along with a modification of the harvest routine to recognize the catalog resource type within the feed:

A) Minimum Required for Compliance

Produce a single catalog or list of data managed in a single table, workspace, or other relevant location. Describe each dataset or existing metadata catalog according to the common core metadata.

This listing can be maintained in a Data Management System (DMS) such as the open-source CKAN platform; a single spreadsheet, with each metadata field as its own column; or a DMS of your choosing. A description of each agency metadata catalog, such as CKAN, can be placed in the agency json file as a single entry. This entry will describe the resource type of "catalog" and the access URL to be used in harvest by data.gov.

Metadata for geographic or geospatial information is often collected using the FGDC Content Standard for Digital Geospatial Metadata or ISO 19115/19139 and represented as XML, providing content that maps to common core metadata. These collections are exposed using the Open Geospatial Consortium Catalog Service for the Web interface (CSW 2.0.2) or as a read-enabled HTTP directory known as a Web Accessible Folder (WAF). In lieu of posting individual entries for each geospatial dataset in the json file, a single json entry should be prepared for each geospatial metadata collection (WAF) or service (CSW) as a "Harvest Source" enabling harvest of the collections by catalog.data.gov. Individual geospatial metadata entries for datasets, applications, or services should not be duplicated in the agency json feed.

mhogeweg · 2013-08-05T19:59:51Z

thanks @ddnebert for pointing out that agencies have FGDC/ISO metadata. I had posted a mapping of those specs to DCAT (as the mapping is not trivial with absence of 1:N elements in DCAT, different dates notations, and different interpretations of fields to name a few) and submitted that as pull request #74. would like to see your thoughts on that mapping.

A second point is the focus on getting things to work for CKAN. I understand that Data.gov uses CKAN, but would it not be better if a solution is designed that works across the government regardless of technology? That is what the geospatial domain has been working on for many years and what was done to promote an open ecosystem of suppliers and consumers of data. To me that also relates to @jeffdlb's point regarding programs like GEOSS (which you play a key role in) and GCMD, not to mention Eye on Earth, UNEP Live and various other global initiatives focused on open data sharing.

jeffdlb · 2013-08-05T20:07:28Z

@waldoj - I don't think breaking up the single list into a linked set of lists pushes complexity to the users. The end goal of the inventory is not just to have a list, it is to populate data.gov or other portals or commercial search engines. In all cases, entries in each list (whether there is one list per agency or many) will be going into a database of some type. That database will be updated one entry at a time by reading through the lists.

jeffdlb · 2013-08-05T20:10:29Z

@benbalter - If every bureau's sub-list is linked from the master list, then I believe the data would be less "siloed" than currently. At present any bureau-level inventories are not standardized, whereas this effort would standardize and link them.

cew821 · 2013-08-05T20:26:46Z

@ddnebert I like this suggestion. In fact, some geospatial energy data has already been posted to data.gov using this approach (CKAN harvesting of a CSW endpoint). See below for details of this example:

The Department of Energy is helping create a datastore for geothermal data called the National Geothermal Data Store. One of the nodes of this system is the State Geothermal Database. This data store conforms to the ISO 19139 geospatial metadata standard discussed above, and uses the Catalog Service for the Web (CSW) standard to provide interoperability. Here's an example CSW endpoint to the same data store linked to above.

Because the CKAN harvester knows how to inter-operate with the CSW, all of the datasets (including the geospatial data) were able to be added to data.gov just by pointing the harvester at the CSW endpoint. End result? A search for 'borehole temperatures' on data.gov yields 59 datasets, with intact metadata, available in Esri REST API, WMS, WFS, and ZIP file formats.

For agencies that have existing data portals that can be easily harvested by CKAN, pointing the harvester to the data store itself could make a lot of sense.

I should note that not ALL of the Department's data will be easily harvested in this way. For legacy or custom data portals that did not use standards, or are otherwise not up to date, creating the data.json file is a good forcing function. But for systems that have this capability, why not use it?

waldoj · 2013-08-06T02:42:30Z

CKAN's CSW syndication functionality is great, but this initiative is not about expanding CKAN support within the federal government. The goal is to "produce a single catalog or list of data managed in a single table, workspace, or other relevant location". Putting some data behind a different protocol runs counter to this goal (the data then ceases to be "in a single table"), and erects a significant hurdle to anybody who wants to syndicate that data (requiring parsing 2+ catalog formats).

Perhaps some tools could be built to solve this problem via both design patterns? That is, the need to syndicate data can be address via harvesting from CSW (for example), precisely as proposed here. Simultaneously, some tools can be built to extract that catalog data and convert it into the data.json format, alongside the existing tools to aid in this transition. I'd be happy to scrub in to help create such a converter!

MarionRoyal · 2013-08-06T14:15:54Z

Very good discussion. If I might add my couple of pennies...

History

The original DataGov metadata template created back in 2009 was done rather
quickly and at that time DCAT really hadn't taken hold and
schema.orgwasn't even an idea. The success of the original template
was demonstrated
by the fact that we only had one major revision (congratulations to the
multi-agency team who developed it.) Based on Dublin Core, it clearly had
the basic metadata (and some fields that were never used) needed to express
the basic concept of a stored chunk of data. If we could have found a
simple standard to use instead of building our own, we would have done so.
To be fair, though, we had some administrative requirements such as "Does
this Dataset meet your agencies compliance to the Data Quality Act?" which
would not likely be found in any other standard.
It was well understood that the DataGov metadata template was a temporary
solution and not a long term "Standard".

DCAT evolved as a means of connecting catalogs together and in doing so,
harmonized most of the metadata catalog terms (and incorporated other
namespaces like DC and FoaF). Not different enough from the original
DataGov template to make a change but notable as we considered mapping to
other schema's.

When we started tying DataGov with Geospatial One-Stop (GOS), we used an
API on the Geo Catalog to map and display the metadata at Data.gov using
the DataGov template. This was fine as long as there was a rich metadata
catalog accessible for geospatial mapping and storage of map services (I
have now reached my extent of knowledge of geo-speak).
The Geospatial metadata was well established before DataGov albeit with an
FGDC evolution to ISO 19115 (I have now reached my extent of knowledge of
FGDC vs ISO). Simple mapping from the geospatial metadata to DataGov
template was never a problem. However, the DataGov template was a
_SUBSET_of all of the fields found in either FGDC or ISO. The fields
used in the
geo community were way too specialized to even be considered for non-geo
records.

We kept this rich catalog of metadata along with it's harvesting
capability as we brought together DataGov and GOS and duplicated their
infrastructure at geo.data.gov. We used the functionality of the
geo.data.gov catalog as a requirement as we deployed and contributed to the
CKAN software. The end result is that catalog.data.gov supports harvesting
of FGDC and ISO and the basic catalog requirements of GeoPlatform.gov
(thank you Doug et al).

There are experts already in this discussion who knows the details
exponentially to me, so I won't dally around. It is important for those
who do not know the history to poke around a bit to understand the
complexity of the geo community.

So

We are asking the geospatial community to abandon their long-fought process
of establishing an international metadata standard through ISO (this
process is a career, not a project) and adopting a NEW CORE schema in a way
that is better than the original DataGov metadata template but does not
contain the rich catalog information needed for their community.

They just can't do that. Of course they can develop tools to spit out
the JSON metadata (subset) from their records. We can do that from
Data.gov. But what value is it? It doesn't contain the metadata that any
geo-scientist would need. They would still need to maintain the FGDC/ISO
records. It would put us in a situation where we are treading down the
path to separate catalogs which is the opposite direction that we need to
be headed in.

Today

Today we are preparing to harvest agency metadata using the new CORE schema
which has been validated against the original DataGov metadata template and
is obviously an improvement. The new CORE schema is a SUBSET of a
mapping to FGDC/ISO. At DataGov, we will continue to harvest geospatial
metadata using FGDC and/or ISO19115 (even if the subset of metadata is in
the JSON file), because that's the requirement for GeoPlatform.gov and
other geo-scientists.

Reality (from my perspective)

The new CORE metadata schema is a temporary solution, not a permanent one.
It will not become an ISO standard. It will not supersede Dublin Core.
It will not even supersede DCAT. It is not even based in a well known
namespace.

However, it may be what we need just now; a simple way for agencies to make
their data available in a way that the public understands it (and by
public, I include developers.) It also flips around the manner is which
these data are published in a way that applications (other than CKAN and
Data.gov) can parse the information and make use of it.
Shall we expand this temporary solution in a way that it meets the
requirements of all? Heck no!

Instead, we should seek a longer term solution (maybe not permanent but
certainly more scalable.) Is the long term solution ISO19115? Sorry no.

I think that the long term solution that we should be working toward is *
SCHEMA.ORG*. I won't go into the reasons. That's a whole different
discussion and if I don't make everybody mad, I would love to be part of
that work.

So What (in my opinion)

The proposal that this simple (long) JSON file may contain pointers to more
complex catalogs seems like a reasonable approach to me. Developers should
be sophisticated enough to recognized a field that states this pointer will
require a different parsing mechanism and either ignore or pursue. In the
mean time, let's continue our goal for a long term solution and make that
solution the best it can be.

On Mon, Aug 5, 2013 at 10:42 PM, Waldo Jaquith [email protected]:

CKAN's CSW syndication functionality is great, but this initiative is
not about expanding CKAN support within the federal government. The goal is
to "produce a single catalog or list of data managed in a single table,
workspace, or other relevant location"http://project-open-data.github.io/implementation-guide/.
Putting some data behind a different protocol runs counter to this goal
(the data then ceases to be "in a single table"), and erects a significant
hurdle to anybody who wants to syndicate that data (requiring parsing 2+
catalog formats).

Perhaps some tools could be built to solve this problem via both design
patterns? That is, the need to syndicate data can be address via harvesting
from CSW (for example), precisely as proposed here. Simultaneously, some
tools can be built to extract that catalog data and convert it into the
data.json format, alongside the existing tools to aid in this transitionhttp://project-open-data.github.io/#4_tools.
I'd be happy to scrub in to help create such a converter!

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/105#issuecomment-22155058
.

Marion A. Royal PMP
Program Director, DataGov
GSA Office of Citizen Services and Innovative Technologies
202.302.4634

ddnebert · 2013-08-06T15:04:07Z

Regarding the statement: "CKAN's CSW syndication functionality is great, but this initiative is not about expanding CKAN support within the federal government. The goal is to "produce a single catalog or list of data managed in a single table, workspace, or other relevant location". " it is my understanding that the intention of the .json feeds was to feed the single government data search engine (powered by CKAN) to create a comprehensive view of governmental metadata. The creation of a feed (or catalog service) is only a means to an end since the metadata cache at CKAN provides the necessary interface to access all governmental metadata, or selected metadata via query. BTW, CKAN has the ability to expose its results in json and RDFa, albeit a DCAT flavor, as the result of a query. So, if it is json format that you need, it is also accessible there in addition to several other APIs.

By allowing multiple (already supported) protocols and formats, we have now created a seamless virtual catalog of government metadata. We require the option to reference the existence of metadata collections as WAF or CSW individual entries within the agency json feed. This is an operational solution that fulfills the requirements of both the geospatial and 'raw' data communities - allowing agency and federated views, exposing actionable APIs, enabling counts and tracking, and most importantly enabling access to the data, services, and applications of our federal, state, local, tribal, and academic partners.

waldoj · 2013-08-06T15:13:32Z

it is my understanding that the intention of the .json feeds was to feed the single government data search engine (powered by CKAN) to create a comprehensive view of governmental metadata.

The stated purpose of providing the specified data.json file, as described in Slash Data Catalog Requirements, is "to make [the catalog] more easily discoverable to private-sector developers and entrepreneurs." Data.gov is just one client.

ddnebert · 2013-08-06T15:58:09Z

The proposed hybrid solution that feeds a common search and retrieval API (that can return json and other formats) supports that requirement even better. With /data in every agency, developers would need to locate such directories and then visit each one individually. With a common search interface built on the json syndication, developers will have an easier time interacting with the metadata through a single entrypoint.

waldoj · 2013-08-06T16:26:10Z

I follow, but M-13-13 says:

Any datasets in the agency’s enterprise data inventory that can be made publicly available must be listed at www.[agency].gov/data in a human- and machine-readable format

That's a per-agency mandate. There are lots of details about implementation that can be altered, but fundamentally M-13-13 requires a complete inventory, on the agency website, within /data, as human- and machine-readable data.

ddnebert · 2013-08-06T16:58:07Z

One could interpret the requirement to be satisfied by including in the json entries for each collection description. You're right, that is a 'mandate' rather than explaining the objective or outcome. If the desired outcome is to produce links to all govt metadata, the hybrid solution satisfies it. We should recommend modification of M-13-13 to include the implemented capabilities that currently provide access to over half a million government data asset descriptions already.

I'm also thinking that json does not strictly satisfy the M-13-13 desire for a 'human readable' format; the query results from CKAN can be formatted and styled in many ways.

gbinal · 2013-08-06T16:59:00Z

Just to echo part of @MarionRoyal 's point about the pragmatic push currently going on, I think it's worth noting that we all know that NOAA and USGS are the two 900 lbs. gorillas when it comes to number of entries. They account for 5/6 of the datasets in data.gov currently (32k and 18k entries respectively). I agree with @waldoj that the move is to create a solution for them while keeping intact the simple and clear agency.gov/data.json requirement as it currently exists for the other 168 agencies currently reporting data in Data.gov. I'm working with those 98% of agencies that should be able to handle this fine for the short and medium term but agree that we need to figure out something that can scale for NOAA and USGS.

I agree that each level of complexity we introduce to the structures of data.json files increases the burden for third party adoption and costs us more in the long run.

skybristol · 2013-08-06T17:12:33Z

+1 to @waldoj comment, and -1 to @ddnebert response (sorry, Doug, I have to disagree with you on this one). Is Data.gov the "one ring to rule them all"? I think the world has moved well beyond the one stop shop paradigm whether we're talking about data assets or shoes. As far as I'm concerned, the big driver that the giant comprehensive catalog addresses is the management itch (that all of us should share and appreciate) of determining whether or not we've really done right by the taxpayer in releasing all our wares in a complete, discoverable, and accessible way. If we do that job right, then we should be able to drive all manner of "stop and shop where it makes the most sense" apps across government and the private and commercial sectors.

If we follow the data, information, and knowledge idea, data in context is information, and information leads to knowledge and action. Context is really important. Seismologists, ecologists, and other scientists are interested in different things than resource managers, energy developers, environmental and social activists, and policy analysts (and every other class of data consumer we can think about). All of us probably should have more data at our fingertips when doing whatever it is we're doing so that we can develop a more robust characterization of whatever it is we're examining with data. But it's not a one size fits all world. Why not go about this in a way that better enables the unanticipated good uses of our resource "listings"?

The most important thing about this (in my mind) is that we not conduct this as yet another data call. This process has to get baked into the different agencies at a level that is sustainable and evolvable over time with changing requirements, backend processes, and increased data holdings. The implementation needs to balance between the need for some level of standardization (so that downstream consumers like Data.gov have a somewhat predictable playing field) while allowing for some reasonable variability in processes and methods such that the data providers can figure out how to make it last.

Some of us (gov agencies) have wonderfully mature catalog systems of formal metadata already in place. Others of us have dozens or hundreds of potential catalogs that might not all meet the same level of maturity. Still others have piles of "metadata" in every conceivable format and state completeness. As with all technology, there are 50 different ways of doing anything. Perhaps we can do a little more work defining the use cases associated with the machine-readable aspect of this deal and then let the agencies come up with the creative ways of getting there.

From what I've been hearing and reading, providing a way for Data.gov to go from a "push me" to a "pull you" way of aggregating is one of those. I'd like to see that use case a little more spelled out in terms of what might be changing for catalog.data.gov. It would also be nice to understand if there is some difference in approach that is anticipated between catalog.data.gov, next.data.gov, and the various other x.data.gov things that seem to be going on.

Another use case I eventually want to pursue is specifically with the major earth science agency partners (USGS, NOAA, NASA, USDA, EPA). Being a USGS guy, I know that there are USGS data assets and derivative data products that live in the holdings of other agencies. I can search NASA's ECHO catalog or NOAA's GeoPortal and find some of them. If we had reached a level of maturity in uniquely identifying everything released with a registered DataCite DOI and referenced those everywhere, the problem of negotiating between different derivations on the same data and understanding authoritativeness might already be solved. But we ain't there yet. So, I might want to write some software to go looking for potential interconnections between things that I know about and things that NOAA knows about based on the raw inventories we are each listing publicly. Sure, I might be able to do that using a Data.gov API once everything is all aggregated there nicely, but then again, maybe I'd rather develop a whole new algorithm based on creating a linked data asset from selective crawls of source material that's not supported by how Data.gov has gone about its aggregation or the form of data provided by its API. Having hopefully established some interconnections with things known from the USGS context, I want to exploit those in different ways through data management practices to make the field cleaner, recommender systems for end users, and other methods.

skybristol · 2013-08-06T17:20:36Z

@waldoj said...

...fundamentally M-13-13 requires a complete inventory, on the agency website, within /data, as human- and machine-readable data...

It was my understanding that /data/ was fundamentally for the public data listing part of this goal we're shooting for with a data.json (or whatever we end up coming to through this discussion) and some type of human-readable interface (browse, search, etc.). But I understood the complete data inventory (both public and nonpublic data) to be another matter, potentially driven off of various data management systems, agency catalogs, etc. On a teleconference (last week, I think) there was discussion on some uses OMB might be making of the inventory that would make it desirable to also have those available in the same type of JSON format or in some way to facilitate cross-agency analysis.

I don't know who you are, but could you elaborate on your thinking if you are one of those "in the know"?

(Perhaps this issue ought to go off to a new thread.)

ddnebert · 2013-08-06T19:05:29Z

On 8/6/13 1:12 PM, skybristol wrote:

+1 to @waldoj https://github.com/waldoj comment, and -1 to @ddnebert
https://github.com/ddnebert response (sorry, Doug, I have to
disagree with you on this one). Is Data.gov the "one ring to rule them
all"? I think the world has moved well beyond the one stop shop
paradigm whether we're talking about data assets or shoes. As far as
I'm concerned, the big driver that the giant comprehensive catalog
addresses is the management itch (that all of us should share and
appreciate) of determining whether or not we've really done right by
the taxpayer in releasing all our wares in a complete, discoverable,
and accessible way. If we do that job right, then we should be able to
drive all manner of "stop and shop where it makes the most sense" apps
across government and the private and commercial sectors.

And what we propose supports both aims - exposing individual agency
feeds, some with embedded references to catalogs, and a search/browse
facility across this federation to enable broad access. In terms of
cart-and-horse, I see the data.gov facilities as a primary means to
realize the goals of the Open Data Policy since it supports the
pan-governmental view that the .json enabled approach does not support
by itself. We can have it both ways with very little extra work. If you
read between the lines, CKAN can also expose a filter-driven json feed
(or delivery of XML+RDFa files) against all harvested metadata for the
whole of government.

skybristol · 2013-08-06T19:54:35Z

So, circling back to the original concept posed by @jeffdlb and then built on by @ddnebert, it seems like we have the following two proposals:

@jeffdlb - Follow the Project Open Data (POD) JSON schema to provide discovery-level metadata for each discrete dataset but allow a network of nodes within a given agency such that top-level data.json might point to other locations where further "data.json" files could be crawled/aggregated to create the whole.
@ddnebert - Allow for the use of not only the POD schema but accept established formal metadata standards in XML (ISO19115/19139 and FGDC CSDGM are mentioned), using the top-level agency data.json as an index/directory pointing to catalog services (CSW) or web accessible folders where such metadata can be harvested.

Those seem to be two widely different proposals (if I've got them right), and I wonder if they don't deserve to be restated to start separate threads for debate.

The comment from @ddnebert above seems to point to Data.gov's CKAN implementation as the solution for OMB scrutiny and any other uses of a simple JSON output of discovery-level metadata, allowing for agencies with formal metadata holdings to simply provide those as their public data listing without "dumbing down" the catalog to the more simple POD attributes.

ddnebert · 2013-08-06T20:03:44Z

Well stated. I would also add that the CKAN implementation can already support harvest of .json and CSW/WAF. It is a minor tweak to identify catalog references within the agency .json file. We can experiment with exposing the federated catalog (CKAN) as filtered .json for developer access with all entries looking the same yet provide access and indexing of the robust metadata where it exists.

mhogeweg · 2013-08-06T20:14:09Z

@ddnebert what you describe looks something like this http://gptogc.esri.com/geoportal/rest/repositories. That response is a very specific list of repositories registered in a catalog. A client could take this list for harvesing, syndication, brokering, synchronization, indexing (or what the term of the day is).

I'm curious to the ways people expect to use Data.gov. Yes, there is a CKAN interface and I've integrated with that to perform some basic searching, say for water quality.

currently the CKAN API doesn't seem to return a total count of items. on providing DCAT, you can also get the [same search results as DCAT](http://gptogc.esri.com/geoportal/rest/find/document?rid=CKAN&searchText=water quality&start=1&max=10&f=dcat) from the same site. all of this does not prevent or mandate full verbose ISO/FGDC metadata. it would just expose to a user what is relevant given the user's search request.

what I think we all want to avoid is to create a one-of-a-kind process like used in Recovery.gov for transparency reporting or in the EPA CDX environmental reporting. those are good use cases for a specific process because specific information is exchanged at set frequencies with 1 user. in the field of open data (geo or non-geo) we all want to find that one awesome dataset, regardless of where it is registered or regardless what website/client we use... #different

ultrasaurus · 2013-08-11T14:29:47Z

If I understand the details of this thread correctly, I'd like to offer Libraries and Archives as another use case in support of:

@ddnebert - Allow for the use of not only the POD schema but accept established formal metadata standards in XML (ISO19115/19139 and FGDC CSDGM are mentioned), using the top-level agency data.json as an index/directory pointing to catalog services (CSW) or web accessible folders where such metadata can be harvested.

My understanding is that under this proposal there would be a /data.json that references pre-existing standard repositories that are in use by established communities of data publishers and developers. Archives and libraries are early adopters of open data with MARC and EAD standards already published and in active use by various communities and partners. Notable is the Digital Public Libraries of America (relevant background here: http://dp.la/info/get-involved/partnerships/) which already aggregates a huge amount of public data. Many Smithsonian archives and libraries already have public data repositories that are contributed to DPLA. Like the scientific and technical government agencies, the Smithsonian Institutions has a huge amount of open data, already in use by scientists and humanities researchers.

It believe that it would be a huge win to publicize these existing resources and reference well-established standards, drawing new developers and industry into existing communities of experts.

seanherron · 2013-08-26T14:52:52Z

I think that this conversation boils down to two points that everyone can probably agree on: 1) we want it to be easy for people to find government data and 2) we want it to be easy for government to make data available to the public.

With those two core objectives in mind, I'd like to highlight what I perceive to be the primary ways to make that happen:

If you go to agency.gov/data, you are able to view our best-faith effort in indexing all of the data that agency has, both in a human and machine readable interface.
Agencies across the government use a single standard to publish that index so that they can be easily aggregated. That standard is straightforward, simple, and is designed to be a starting point to other linked datasets.
Agencies need a minimal amount of resources to publish data in that standard - the barrier of entry is low, even for someone with no familiarity of the standards world.

As @gbinal noted, we all know the geo agencies are way ahead of everyone else in terms of publishing and sharing data. However, the more complexity we add to the schema, the fewer people at other agencies will understand it. I think this is a great example of a time when it is best to make things simple. For the agencies which are already publishing metadata, it is relatively easy to convert that to a single data.json file (using either one of a number of tools listed on this site or a quick parser they can write themselves). While that data.json file may not contain the rich information they are used to publishing, it does mean that now they are talking on the same playing field as the other agencies which are publishing data.json files. Any other relevant information they want to provide can be listed as expanded fields, and any other linked data they want to publish can be listed under something like endpoint, download url, or data dictionary (which people can then scrape and pull from).

In the interest of keeping things simple, my vote would be to, for now, focus on creating this single file. How agencies get there is up to them - if they want every internal organization to publish their own data.json file which is then aggregated in to a single, top-level file, that's fine. But to introduce what would be fairly substantial changes to the schema this close to November would, in my opinion, add unneeded complexity and ultimately make it more difficult for third party services (only one of which is data.gov) to accomplish the indexing we are trying to achieve.

👾

ddnebert · 2013-08-26T16:09:18Z

The JSON file is a means to an end, supporting syndication of content to be harvested into the index (search engine). How will programmers know where these agency files are? Programmers will not likely attach to every agency to download and parse the files, index them themselves, in order to find data. It is the search engine and API that makes these feeds and the things they point to (like catalogs or data or services) most valuable.

Our proposal is already simple and supports the common schema and indexing tools. The json file can include one or more references to metadata catalogs that contain more detail in addition to raw metadata with its simpler descriptions. All the hundreds of thousands of records get indexed into the search engine in support of the two points you identify: 1) we want it to be easy for people to find government data and 2) we want it to be easy for government to make data available to the public. The end user is given the search capabilities in catalog.data.gov that do not exist on the json file - that is not its intention. They will also see the same type of common record on initial delivery but are given the option to dive further, if interested, into the full metadata. The geospatial publishers are already up and running with CKAN support and it does not impose any burden on the 'raw' data publishers.

seanherron · 2013-08-26T16:15:25Z

@ddnebert, it may be useful if you could develop a proposed schema change and submit it as a pull request.

I'm not sure how your proposal is better than the status quo in regard to enabling programmers to know where the agency files are. With the status quo, we centralized this so that agency.gov/data.json is the known place to grab the data, or latch on to a service like data.gov and query via API. Your proposal seems to complicate this with additional crawling needs and more schema information, however, if you could provide an example implementation maybe it will help me understand more accurately.

waldoj · 2013-08-26T17:14:00Z

Programmers will not likely attach to every agency to download and parse the files, index them themselves, in order to find data.

I can't see why not. It'd be trivial to loop through a list of every federal government website, grab the data.json file, and gather data from it. That's perhaps five minutes of work. More likely, though, people are going to get the data.json files from the few agencies that they're interested in or comprising the totality of datasets for the topic that they're interested in—every spending-related dataset, every geospatial dataset, etc. That's very simple under this plan.

Your alternate proposal (as I understand it; there's no pull request to evaluate) requires that, instead, developers query a series of different types of catalog files, varying between agencies, nested within an existing dataset, to even find out which datasets are available. I can't see why we wouldn't expose the entire list of datasets at a data.json level, including of course the specialized, existing metadata catalogs that you're describing, since they are their own dataset. Anybody who wants to browse those specific endpoints to get detailed information can do so, while those who are happy with the limited data provided within data.json don't need to do so.

As I've said before, the stated goal of this endeavor is to "produce a single catalog or list of data managed in a single table, workspace, or other relevant location". Putting some data behind a different protocol runs counter to this goal (the data then ceases to be "in a single table"), and erects a significant hurdle to anybody who wants to syndicate that data (requiring parsing n catalog formats, instead of just 1).

ddnebert · 2013-08-26T17:34:04Z

I made a pull request in May to amend the implementation document. There is no schema to change, only practice to codify: #4

My point is that there is already a robust open API on CKAN that enables query (search facets) using Lucene/Solr on all government data in catalog.data.gov. We always thought that the proposal for having JSON files was primarily to syndicate records for ingest into the catalog and search API. Compared to locating all the federal JSON files, indexing them yourself, and immediately being out-of-date, doesn't it make sense to use the existing search facilities and API? The CKAN harvester already knows how to parse all the json records and all the robust geospatial metadata. The result is a single searchable index - and actually, you could attach to and request a single collosal JSON file from CKAN if you wanted to, with common identical schema, or a subset for an agency. Or, you could use the open query API to do much more advanced things.

This has no other purpose that the stated goal: "to produce a single catalog or list of data" for the entirety of government. It is now a single, cached catalog with an API, not just a series of files at agencies. JSON and references to catalog services supply this index very nicely.

waldoj · 2013-08-26T21:28:35Z

We always thought that the proposal for having JSON files was primarily to syndicate records for ingest into the catalog and search API. Compared to locating all the federal JSON files, indexing them yourself, and immediately being out-of-date, doesn't it make sense to use the existing search facilities and API?

I don't want to beat a dead horse (I'd just be repeating my previous comments here), so I'll just say again that your proposal that thousands of datasets be omitted from data.json files runs counter to the specific mandates within M-13-13, and thus requires enabling language in the form of a White House policy memo.

ddnebert · 2013-08-26T21:49:53Z

Perhaps the discussion is moot - it was clarified today on the POC call that the mandate applies to Departments and independent Agencies for execution. Which means that, at least in our case, DOI will be collating, preparing, and feeding MAX. How individual bureaus work with the Departments to create this posting can be subject to other arrangements. The result will be a www.doi.gov/data.json file emanating from a CKAN instance at DOI with all our geospatial and raw metadata in it. Meanwhile, in catalog.data.gov, all the ingested metadata will be available for live search via the query API based on opensearch.

bsweezy · 2013-08-26T23:18:45Z

Wouldn't developers prefer to query data.gov's CKAN API rather than track
down every agency.gov data.json?
On Aug 26, 2013 5:49 PM, "ddnebert" [email protected] wrote:

Perhaps the discussion is moot - it was clarified today on the POC call
that the mandate applies to Departments and independent Agencies for
execution. Which means that, at least in our case, DOI will be collating,
preparing, and feeding MAX. How individual bureaus work with the
Departments to create this posting can be subject to other arrangements.
The result will be a www.doi.gov/data.json file emanating from a CKAN
instance at DOI with all our geospatial and raw metadata in it. Meanwhile,
in catalog.data.gov, all the ingested metadata will be available for live
search via the query API based on opensearch.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/105#issuecomment-23297953
.

seanherron · 2013-08-27T15:14:23Z

@bsweezy Yes, but data.gov needs to get that data from somewhere. Data.gov will pull from each data.json file, and external orgs that want to make a competitor can do the same.

mhogeweg · 2013-08-27T15:22:37Z

@seanherron with 'competitor' you surely meant 'an additional channel to open data goodness', right? ;-)

seanherron · 2013-08-27T15:29:45Z

@mhogeweg yes - my hope is that other groups continually put a little heat on the data.gov team to keep improving ;)

waldoj · 2013-08-27T17:31:46Z

I'm pretty psyched for the possibility of a competitor, which data.json facilitates. Data.gov is great, and it keeps improving, but competition raises the possibility of better things still. There are some things that government can't do, for reasons of politics or limits of power or privacy, but that the private sector can do. I bet we'll find that some things in that area can be applied nicely to government data inventories.

MarionRoyal · 2013-08-27T17:48:02Z

Perhaps we can maintain and publish on DataGov a list of all harvest
sources (with syntax/schemata capabilities) to make it easier for our
competitors. I don't think it could get much warmer though.

Marion A. Royal
202.302.4634

Sent from PDA

On Aug 27, 2013, at 1:40 PM, Waldo Jaquith [email protected] wrote:

I'm pretty psyched for the possibility of a competitor, which
data.jsonfacilitates.
Data.gov is great, and it keeps improving, but competition raises the
possibility of better things still. There are some things that government
can't do, for reasons of politics or limits of power or privacy, but that
the private sector can do. I bet we'll find that some things in that area
can be applied nicely to government data inventories.

—
Reply to this email directly or view it on
GitHubhttps://github.com//issues/105#issuecomment-23355033
.

ajturner · 2014-02-24T04:38:02Z

this conversation seems to have devolved into a bit of philosophy and diverged from the original request.

I'm interested if a practical decision has prevailed. Pagination is a simple capability that every developer and tool understands well. Anyone reading this thread has probably paginated over twitter/github/email/RSS feeds/etc. Catalogs are growing in size, and as @jeffdlb points out, a good, simple, spec can grow adoption across multiple platforms and internally. We're already seeing catalogs in the 10k-100k+ range.

For simple practicality, OpenSearch-Atom has helpful next links:

<link rel="self" href="http://example.com/New+York+History?pw=3&amp;format=atom" type="application/atom+xml"/>
   <link rel="first" href="http://example.com/New+York+History?pw=1&amp;format=atom" type="application/atom+xml"/>
   <link rel="previous" href="http://example.com/New+York+History?pw=2&amp;format=atom" type="application/atom+xml"/>
   <link rel="next" href="http://example.com/New+York+History?pw=4&amp;format=atom" type="application/atom+xml"/>
   <link rel="last" href="http://example.com/New+York+History?pw=42299&amp;format=atom" type="application/atom+xml"/>

gbinal · 2014-07-17T20:02:34Z

Discussing the initial issue raised here (single file v. federated files) with others and there's still conflict over the right balance. Agencies feel the need to be able to federate but there's still a compelling interest in having the simple requirement of all data from an agency being accessed in a straightforward, direct way.

gbinal · 2014-07-24T15:52:40Z

FYI - This also overlaps with Issue #308.

rebeccawilliams · 2015-07-31T23:58:15Z

@jeffdlb, I am curious if these tools satisfy your original concerns:

The data.json merger: http://labs.data.gov/dashboard/merge
https://inventory.data.gov/ (or if not inventory, the emerging CKAN multisite: https://github.com/datacats/ckan-multisite)

jeffdlb · 2015-08-01T17:21:09Z

@rebeccawilliams - Thanks for your follow-up note. The short answer is No, they unfortunately do not satisfy the original concern.

data.json merger is not helpful for us because we already have a tool (CKAN) to produce data.json from multiple sources that have existing metadata that is not in data.json format.
I cannot evaluate inventory.data.gov because all of the links, including About this site, seem to redirect me to the login page.
CKAN multisite is not useful for us because we only have 1 CKAN instance.

The fundamental concerns raised in my original post are that
(a) a huge flat file with no structure, organization, sorting, or partial-update capability is not a very useful way to exchange information about large inventories (we have >63000 entries) whose contents change with time; and
(b) we already produce metadata in ISO or FGDC XML format that is far more complete and better-structured than the data.json approach.

Again, thanks for following up.

Regards,
Jeff DLB

mhogeweg · 2015-08-01T20:20:54Z

I agree with @jeffdlb view. In Esri Geoportal Server we can harvest other data.json files and provide a single file, as well as provide pagination using data.json as one of the output formats of our open search endpoint. This would allow a harvester to page through the contents of a catalog quite easily, fetching chunks of the catalog instead of a single large file.

to overcome the updating issue raised, we generate a 'cached' version of the catalog's content in data.json on a regular basis (hourly, daily, weekly, depending on desired frequency).

Brykerr78 · 2019-01-22T21:49:07Z

😂😂😂😂😂😂

sheriff-Nik · 2019-01-28T07:45:27Z

What?

…

On Wed, Jan 23, 2019, 12:49 AM Brykerr78 ***@***.***> wrote: 😂😂😂😂😂😂 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#105 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVlcqunQH7ldL6FSlfwtn9oMvojv6Bkoks5vF4dXgaJpZM4A4A1I> .

skybristol mentioned this issue Aug 7, 2013

Quality over Quantity #109

Closed

philipashlock added this to the Next Version of Common Core Metadata Schema milestone May 8, 2014

philipashlock added schema structure labels May 8, 2014

philipashlock modified the milestone: Next Version of Common Core Metadata Schema (1.0 -> 1.1.) Jul 24, 2014

philipashlock added the schema (future) label Jul 24, 2014

gbinal removed this from the Next Version of Common Core Metadata Schema (1.0 -> 1.1.) milestone Jul 24, 2014

haleyvandyck modified the milestone: Next Version of Common Core Metadata Schema (1.0 -> 1.1.) Jul 24, 2014

gbinal removed the schema v1.1 label Jul 24, 2014

gbinal removed this from the Next Version of Common Core Metadata Schema (1.0 -> 1.1.) milestone Jul 24, 2014

rebeccawilliams assigned philipashlock Aug 3, 2015

ajturner mentioned this issue Aug 23, 2015

Add aggregate data.json as downloadable catalog entry and queryable API endpoint GSA/datagov-wptheme#315

Closed

Single-file /data catalog not good--optional alternative suggested #105

Single-file /data catalog not good--optional alternative suggested #105

Comments

jeffdlb commented Aug 5, 2013

waldoj commented Aug 5, 2013

mhogeweg commented Aug 5, 2013

benbalter commented Aug 5, 2013

waldoj commented Aug 5, 2013

ddnebert commented Aug 5, 2013

A) Minimum Required for Compliance

mhogeweg commented Aug 5, 2013

jeffdlb commented Aug 5, 2013

jeffdlb commented Aug 5, 2013

cew821 commented Aug 5, 2013

waldoj commented Aug 6, 2013

MarionRoyal commented Aug 6, 2013

ddnebert commented Aug 6, 2013

waldoj commented Aug 6, 2013

ddnebert commented Aug 6, 2013

waldoj commented Aug 6, 2013

ddnebert commented Aug 6, 2013

gbinal commented Aug 6, 2013

skybristol commented Aug 6, 2013

skybristol commented Aug 6, 2013

ddnebert commented Aug 6, 2013

skybristol commented Aug 6, 2013

ddnebert commented Aug 6, 2013

mhogeweg commented Aug 6, 2013

ultrasaurus commented Aug 11, 2013

seanherron commented Aug 26, 2013

ddnebert commented Aug 26, 2013

seanherron commented Aug 26, 2013

waldoj commented Aug 26, 2013

ddnebert commented Aug 26, 2013

waldoj commented Aug 26, 2013

ddnebert commented Aug 26, 2013

bsweezy commented Aug 26, 2013

seanherron commented Aug 27, 2013

mhogeweg commented Aug 27, 2013

seanherron commented Aug 27, 2013

waldoj commented Aug 27, 2013

MarionRoyal commented Aug 27, 2013

ajturner commented Feb 24, 2014

gbinal commented Jul 17, 2014

gbinal commented Jul 24, 2014

rebeccawilliams commented Jul 31, 2015

jeffdlb commented Aug 1, 2015

mhogeweg commented Aug 1, 2015

Brykerr78 commented Jan 22, 2019

sheriff-Nik commented Jan 28, 2019 via email