-
Notifications
You must be signed in to change notification settings - Fork 595
Single-file /data catalog not good--optional alternative suggested #105
Comments
@jeffdlb, is it fair to say that the Sitemap index standard is a good model here? |
@jeffdlb @waldoj I'd say so and had done so before in issue #27 where I suggested pagination like is seen in OpenSearch providers or the breaking up in smaller files as done in sitemap. I think Jeff's point about supporting needs not just of Data.gov but also think about how agencies support other initiatives also deserves consideration here. |
I'd argue that the schema was written with developers in mind, not the ease of agency adoption (or data.gov). When the two are in conflict, we should err on the side of those we want to encourage to use the data, not those whose job it is to publish or organize the data. Going into multiple formats may make things easier for agencies, but it does so at the average developer's expense. This is a great case for practicality over purity. A single Practically, if we allow sub-data files two implications:
Forcing agencies to make a single |
But doesn't the process that you're describing still necessitate that "some software will need to merge everything into a giant list"? It seems that complexity is being pushed out to the clients, rather than being resolved within the federal agency. |
A subtlety in this proposal, and one I opened as a suggestion in github two months ago, is that we allow the data.json file to contain entries for existing standards-based catalogs or APIs. Each such collection - and we manage many in the geospatial domain - could be simply marked up, one entry per collection, within the agency's json file with the breadcrumbs to perform the query and or indexing. CKAN already has built-in harvesting capability on these established protocols, as well as json, so the integration challenge would be minor. It would produce the same, if not better results as a serialized subset of metadata in feeds, since it all ends up in the searchable index. The benefits of this approach are many:
I propose the following text changes to the Implementation document, implementation-guide.md, along with a modification of the harvest routine to recognize the catalog resource type within the feed: A) Minimum Required for ComplianceProduce a single catalog or list of data managed in a single table, workspace, or other relevant location. Describe each dataset or existing metadata catalog according to the common core metadata. This listing can be maintained in a Data Management System (DMS) such as the open-source CKAN platform; a single spreadsheet, with each metadata field as its own column; or a DMS of your choosing. A description of each agency metadata catalog, such as CKAN, can be placed in the agency json file as a single entry. This entry will describe the resource type of "catalog" and the access URL to be used in harvest by data.gov. Metadata for geographic or geospatial information is often collected using the FGDC Content Standard for Digital Geospatial Metadata or ISO 19115/19139 and represented as XML, providing content that maps to common core metadata. These collections are exposed using the Open Geospatial Consortium Catalog Service for the Web interface (CSW 2.0.2) or as a read-enabled HTTP directory known as a Web Accessible Folder (WAF). In lieu of posting individual entries for each geospatial dataset in the json file, a single json entry should be prepared for each geospatial metadata collection (WAF) or service (CSW) as a "Harvest Source" enabling harvest of the collections by catalog.data.gov. Individual geospatial metadata entries for datasets, applications, or services should not be duplicated in the agency json feed. |
thanks @ddnebert for pointing out that agencies have FGDC/ISO metadata. I had posted a mapping of those specs to DCAT (as the mapping is not trivial with absence of 1:N elements in DCAT, different dates notations, and different interpretations of fields to name a few) and submitted that as pull request #74. would like to see your thoughts on that mapping. A second point is the focus on getting things to work for CKAN. I understand that Data.gov uses CKAN, but would it not be better if a solution is designed that works across the government regardless of technology? That is what the geospatial domain has been working on for many years and what was done to promote an open ecosystem of suppliers and consumers of data. To me that also relates to @jeffdlb's point regarding programs like GEOSS (which you play a key role in) and GCMD, not to mention Eye on Earth, UNEP Live and various other global initiatives focused on open data sharing. |
@waldoj - I don't think breaking up the single list into a linked set of lists pushes complexity to the users. The end goal of the inventory is not just to have a list, it is to populate data.gov or other portals or commercial search engines. In all cases, entries in each list (whether there is one list per agency or many) will be going into a database of some type. That database will be updated one entry at a time by reading through the lists. |
@benbalter - If every bureau's sub-list is linked from the master list, then I believe the data would be less "siloed" than currently. At present any bureau-level inventories are not standardized, whereas this effort would standardize and link them. |
@ddnebert I like this suggestion. In fact, some geospatial energy data has already been posted to The Department of Energy is helping create a datastore for geothermal data called the National Geothermal Data Store. One of the nodes of this system is the State Geothermal Database. This data store conforms to the ISO 19139 geospatial metadata standard discussed above, and uses the Catalog Service for the Web (CSW) standard to provide interoperability. Here's an example CSW endpoint to the same data store linked to above. Because the CKAN harvester knows how to inter-operate with the CSW, all of the datasets (including the geospatial data) were able to be added to For agencies that have existing data portals that can be easily harvested by CKAN, pointing the harvester to the data store itself could make a lot of sense. I should note that not ALL of the Department's data will be easily harvested in this way. For legacy or custom data portals that did not use standards, or are otherwise not up to date, creating the data.json file is a good forcing function. But for systems that have this capability, why not use it? |
CKAN's CSW syndication functionality is great, but this initiative is not about expanding CKAN support within the federal government. The goal is to "produce a single catalog or list of data managed in a single table, workspace, or other relevant location". Putting some data behind a different protocol runs counter to this goal (the data then ceases to be "in a single table"), and erects a significant hurdle to anybody who wants to syndicate that data (requiring parsing 2+ catalog formats). Perhaps some tools could be built to solve this problem via both design patterns? That is, the need to syndicate data can be address via harvesting from CSW (for example), precisely as proposed here. Simultaneously, some tools can be built to extract that catalog data and convert it into the |
Very good discussion. If I might add my couple of pennies... History The original DataGov metadata template created back in 2009 was done rather DCAT evolved as a means of connecting catalogs together and in doing so, When we started tying DataGov with Geospatial One-Stop (GOS), we used an We kept this rich catalog of metadata along with it's harvesting There are experts already in this discussion who knows the details So We are asking the geospatial community to abandon their long-fought process They just can't do that. Of course they can develop tools to spit out Today Today we are preparing to harvest agency metadata using the new CORE schema Reality (from my perspective) The new CORE metadata schema is a temporary solution, not a permanent one. However, it may be what we need just now; a simple way for agencies to make Instead, we should seek a longer term solution (maybe not permanent but I think that the long term solution that we should be working toward is * So What (in my opinion) The proposal that this simple (long) JSON file may contain pointers to more On Mon, Aug 5, 2013 at 10:42 PM, Waldo Jaquith [email protected]:
Marion A. Royal PMP |
Regarding the statement: "CKAN's CSW syndication functionality is great, but this initiative is not about expanding CKAN support within the federal government. The goal is to "produce a single catalog or list of data managed in a single table, workspace, or other relevant location". " it is my understanding that the intention of the .json feeds was to feed the single government data search engine (powered by CKAN) to create a comprehensive view of governmental metadata. The creation of a feed (or catalog service) is only a means to an end since the metadata cache at CKAN provides the necessary interface to access all governmental metadata, or selected metadata via query. BTW, CKAN has the ability to expose its results in json and RDFa, albeit a DCAT flavor, as the result of a query. So, if it is json format that you need, it is also accessible there in addition to several other APIs. By allowing multiple (already supported) protocols and formats, we have now created a seamless virtual catalog of government metadata. We require the option to reference the existence of metadata collections as WAF or CSW individual entries within the agency json feed. This is an operational solution that fulfills the requirements of both the geospatial and 'raw' data communities - allowing agency and federated views, exposing actionable APIs, enabling counts and tracking, and most importantly enabling access to the data, services, and applications of our federal, state, local, tribal, and academic partners. |
The stated purpose of providing the specified data.json file, as described in Slash Data Catalog Requirements, is "to make [the catalog] more easily discoverable to private-sector developers and entrepreneurs." Data.gov is just one client. |
The proposed hybrid solution that feeds a common search and retrieval API (that can return json and other formats) supports that requirement even better. With /data in every agency, developers would need to locate such directories and then visit each one individually. With a common search interface built on the json syndication, developers will have an easier time interacting with the metadata through a single entrypoint. |
I follow, but M-13-13 says:
That's a per-agency mandate. There are lots of details about implementation that can be altered, but fundamentally M-13-13 requires a complete inventory, on the agency website, within |
One could interpret the requirement to be satisfied by including in the json entries for each collection description. You're right, that is a 'mandate' rather than explaining the objective or outcome. If the desired outcome is to produce links to all govt metadata, the hybrid solution satisfies it. We should recommend modification of M-13-13 to include the implemented capabilities that currently provide access to over half a million government data asset descriptions already. I'm also thinking that json does not strictly satisfy the M-13-13 desire for a 'human readable' format; the query results from CKAN can be formatted and styled in many ways. |
Just to echo part of @MarionRoyal 's point about the pragmatic push currently going on, I think it's worth noting that we all know that NOAA and USGS are the two 900 lbs. gorillas when it comes to number of entries. They account for 5/6 of the datasets in data.gov currently (32k and 18k entries respectively). I agree with @waldoj that the move is to create a solution for them while keeping intact the simple and clear agency.gov/data.json requirement as it currently exists for the other 168 agencies currently reporting data in Data.gov. I'm working with those 98% of agencies that should be able to handle this fine for the short and medium term but agree that we need to figure out something that can scale for NOAA and USGS. I agree that each level of complexity we introduce to the structures of data.json files increases the burden for third party adoption and costs us more in the long run. |
+1 to @waldoj comment, and -1 to @ddnebert response (sorry, Doug, I have to disagree with you on this one). Is Data.gov the "one ring to rule them all"? I think the world has moved well beyond the one stop shop paradigm whether we're talking about data assets or shoes. As far as I'm concerned, the big driver that the giant comprehensive catalog addresses is the management itch (that all of us should share and appreciate) of determining whether or not we've really done right by the taxpayer in releasing all our wares in a complete, discoverable, and accessible way. If we do that job right, then we should be able to drive all manner of "stop and shop where it makes the most sense" apps across government and the private and commercial sectors. If we follow the data, information, and knowledge idea, data in context is information, and information leads to knowledge and action. Context is really important. Seismologists, ecologists, and other scientists are interested in different things than resource managers, energy developers, environmental and social activists, and policy analysts (and every other class of data consumer we can think about). All of us probably should have more data at our fingertips when doing whatever it is we're doing so that we can develop a more robust characterization of whatever it is we're examining with data. But it's not a one size fits all world. Why not go about this in a way that better enables the unanticipated good uses of our resource "listings"? The most important thing about this (in my mind) is that we not conduct this as yet another data call. This process has to get baked into the different agencies at a level that is sustainable and evolvable over time with changing requirements, backend processes, and increased data holdings. The implementation needs to balance between the need for some level of standardization (so that downstream consumers like Data.gov have a somewhat predictable playing field) while allowing for some reasonable variability in processes and methods such that the data providers can figure out how to make it last. Some of us (gov agencies) have wonderfully mature catalog systems of formal metadata already in place. Others of us have dozens or hundreds of potential catalogs that might not all meet the same level of maturity. Still others have piles of "metadata" in every conceivable format and state completeness. As with all technology, there are 50 different ways of doing anything. Perhaps we can do a little more work defining the use cases associated with the machine-readable aspect of this deal and then let the agencies come up with the creative ways of getting there. From what I've been hearing and reading, providing a way for Data.gov to go from a "push me" to a "pull you" way of aggregating is one of those. I'd like to see that use case a little more spelled out in terms of what might be changing for catalog.data.gov. It would also be nice to understand if there is some difference in approach that is anticipated between catalog.data.gov, next.data.gov, and the various other x.data.gov things that seem to be going on. Another use case I eventually want to pursue is specifically with the major earth science agency partners (USGS, NOAA, NASA, USDA, EPA). Being a USGS guy, I know that there are USGS data assets and derivative data products that live in the holdings of other agencies. I can search NASA's ECHO catalog or NOAA's GeoPortal and find some of them. If we had reached a level of maturity in uniquely identifying everything released with a registered DataCite DOI and referenced those everywhere, the problem of negotiating between different derivations on the same data and understanding authoritativeness might already be solved. But we ain't there yet. So, I might want to write some software to go looking for potential interconnections between things that I know about and things that NOAA knows about based on the raw inventories we are each listing publicly. Sure, I might be able to do that using a Data.gov API once everything is all aggregated there nicely, but then again, maybe I'd rather develop a whole new algorithm based on creating a linked data asset from selective crawls of source material that's not supported by how Data.gov has gone about its aggregation or the form of data provided by its API. Having hopefully established some interconnections with things known from the USGS context, I want to exploit those in different ways through data management practices to make the field cleaner, recommender systems for end users, and other methods. |
@waldoj said... ...fundamentally M-13-13 requires a complete inventory, on the agency website, within /data, as human- and machine-readable data... It was my understanding that /data/ was fundamentally for the public data listing part of this goal we're shooting for with a data.json (or whatever we end up coming to through this discussion) and some type of human-readable interface (browse, search, etc.). But I understood the complete data inventory (both public and nonpublic data) to be another matter, potentially driven off of various data management systems, agency catalogs, etc. On a teleconference (last week, I think) there was discussion on some uses OMB might be making of the inventory that would make it desirable to also have those available in the same type of JSON format or in some way to facilitate cross-agency analysis. I don't know who you are, but could you elaborate on your thinking if you are one of those "in the know"? (Perhaps this issue ought to go off to a new thread.) |
On 8/6/13 1:12 PM, skybristol wrote:
|
So, circling back to the original concept posed by @jeffdlb and then built on by @ddnebert, it seems like we have the following two proposals:
Those seem to be two widely different proposals (if I've got them right), and I wonder if they don't deserve to be restated to start separate threads for debate. The comment from @ddnebert above seems to point to Data.gov's CKAN implementation as the solution for OMB scrutiny and any other uses of a simple JSON output of discovery-level metadata, allowing for agencies with formal metadata holdings to simply provide those as their public data listing without "dumbing down" the catalog to the more simple POD attributes. |
Well stated. I would also add that the CKAN implementation can already support harvest of .json and CSW/WAF. It is a minor tweak to identify catalog references within the agency .json file. We can experiment with exposing the federated catalog (CKAN) as filtered .json for developer access with all entries looking the same yet provide access and indexing of the robust metadata where it exists. |
@ddnebert what you describe looks something like this http://gptogc.esri.com/geoportal/rest/repositories. That response is a very specific list of repositories registered in a catalog. A client could take this list for harvesing, syndication, brokering, synchronization, indexing (or what the term of the day is). I'm curious to the ways people expect to use Data.gov. Yes, there is a CKAN interface and I've integrated with that to perform some basic searching, say for water quality. currently the CKAN API doesn't seem to return a total count of items. on providing DCAT, you can also get the [same search results as DCAT](http://gptogc.esri.com/geoportal/rest/find/document?rid=CKAN&searchText=water quality&start=1&max=10&f=dcat) from the same site. all of this does not prevent or mandate full verbose ISO/FGDC metadata. it would just expose to a user what is relevant given the user's search request. what I think we all want to avoid is to create a one-of-a-kind process like used in Recovery.gov for transparency reporting or in the EPA CDX environmental reporting. those are good use cases for a specific process because specific information is exchanged at set frequencies with 1 user. in the field of open data (geo or non-geo) we all want to find that one awesome dataset, regardless of where it is registered or regardless what website/client we use... #different |
If I understand the details of this thread correctly, I'd like to offer Libraries and Archives as another use case in support of:
My understanding is that under this proposal there would be a /data.json that references pre-existing standard repositories that are in use by established communities of data publishers and developers. Archives and libraries are early adopters of open data with MARC and EAD standards already published and in active use by various communities and partners. Notable is the Digital Public Libraries of America (relevant background here: http://dp.la/info/get-involved/partnerships/) which already aggregates a huge amount of public data. Many Smithsonian archives and libraries already have public data repositories that are contributed to DPLA. Like the scientific and technical government agencies, the Smithsonian Institutions has a huge amount of open data, already in use by scientists and humanities researchers. It believe that it would be a huge win to publicize these existing resources and reference well-established standards, drawing new developers and industry into existing communities of experts. |
I think that this conversation boils down to two points that everyone can probably agree on: 1) we want it to be easy for people to find government data and 2) we want it to be easy for government to make data available to the public. With those two core objectives in mind, I'd like to highlight what I perceive to be the primary ways to make that happen:
As @gbinal noted, we all know the geo agencies are way ahead of everyone else in terms of publishing and sharing data. However, the more complexity we add to the schema, the fewer people at other agencies will understand it. I think this is a great example of a time when it is best to make things simple. For the agencies which are already publishing metadata, it is relatively easy to convert that to a single data.json file (using either one of a number of tools listed on this site or a quick parser they can write themselves). While that data.json file may not contain the rich information they are used to publishing, it does mean that now they are talking on the same playing field as the other agencies which are publishing data.json files. Any other relevant information they want to provide can be listed as expanded fields, and any other linked data they want to publish can be listed under something like In the interest of keeping things simple, my vote would be to, for now, focus on creating this single file. How agencies get there is up to them - if they want every internal organization to publish their own data.json file which is then aggregated in to a single, top-level file, that's fine. But to introduce what would be fairly substantial changes to the schema this close to November would, in my opinion, add unneeded complexity and ultimately make it more difficult for third party services (only one of which is data.gov) to accomplish the indexing we are trying to achieve. 👾 |
The JSON file is a means to an end, supporting syndication of content to be harvested into the index (search engine). How will programmers know where these agency files are? Programmers will not likely attach to every agency to download and parse the files, index them themselves, in order to find data. It is the search engine and API that makes these feeds and the things they point to (like catalogs or data or services) most valuable. Our proposal is already simple and supports the common schema and indexing tools. The json file can include one or more references to metadata catalogs that contain more detail in addition to raw metadata with its simpler descriptions. All the hundreds of thousands of records get indexed into the search engine in support of the two points you identify: 1) we want it to be easy for people to find government data and 2) we want it to be easy for government to make data available to the public. The end user is given the search capabilities in catalog.data.gov that do not exist on the json file - that is not its intention. They will also see the same type of common record on initial delivery but are given the option to dive further, if interested, into the full metadata. The geospatial publishers are already up and running with CKAN support and it does not impose any burden on the 'raw' data publishers. |
@ddnebert, it may be useful if you could develop a proposed schema change and submit it as a pull request. I'm not sure how your proposal is better than the status quo in regard to enabling programmers to know where the agency files are. With the status quo, we centralized this so that agency.gov/data.json is the known place to grab the data, or latch on to a service like data.gov and query via API. Your proposal seems to complicate this with additional crawling needs and more schema information, however, if you could provide an example implementation maybe it will help me understand more accurately. |
I can't see why not. It'd be trivial to loop through a list of every federal government website, grab the Your alternate proposal (as I understand it; there's no pull request to evaluate) requires that, instead, developers query a series of different types of catalog files, varying between agencies, nested within an existing dataset, to even find out which datasets are available. I can't see why we wouldn't expose the entire list of datasets at a As I've said before, the stated goal of this endeavor is to "produce a single catalog or list of data managed in a single table, workspace, or other relevant location". Putting some data behind a different protocol runs counter to this goal (the data then ceases to be "in a single table"), and erects a significant hurdle to anybody who wants to syndicate that data (requiring parsing n catalog formats, instead of just 1). |
I made a pull request in May to amend the implementation document. There is no schema to change, only practice to codify: #4 My point is that there is already a robust open API on CKAN that enables query (search facets) using Lucene/Solr on all government data in catalog.data.gov. We always thought that the proposal for having JSON files was primarily to syndicate records for ingest into the catalog and search API. Compared to locating all the federal JSON files, indexing them yourself, and immediately being out-of-date, doesn't it make sense to use the existing search facilities and API? The CKAN harvester already knows how to parse all the json records and all the robust geospatial metadata. The result is a single searchable index - and actually, you could attach to and request a single collosal JSON file from CKAN if you wanted to, with common identical schema, or a subset for an agency. Or, you could use the open query API to do much more advanced things. This has no other purpose that the stated goal: "to produce a single catalog or list of data" for the entirety of government. It is now a single, cached catalog with an API, not just a series of files at agencies. JSON and references to catalog services supply this index very nicely. |
I don't want to beat a dead horse (I'd just be repeating my previous comments here), so I'll just say again that your proposal that thousands of datasets be omitted from |
Perhaps the discussion is moot - it was clarified today on the POC call that the mandate applies to Departments and independent Agencies for execution. Which means that, at least in our case, DOI will be collating, preparing, and feeding MAX. How individual bureaus work with the Departments to create this posting can be subject to other arrangements. The result will be a www.doi.gov/data.json file emanating from a CKAN instance at DOI with all our geospatial and raw metadata in it. Meanwhile, in catalog.data.gov, all the ingested metadata will be available for live search via the query API based on opensearch. |
Wouldn't developers prefer to query data.gov's CKAN API rather than track
|
@bsweezy Yes, but data.gov needs to get that data from somewhere. Data.gov will pull from each data.json file, and external orgs that want to make a competitor can do the same. |
@seanherron with 'competitor' you surely meant 'an additional channel to open data goodness', right? ;-) |
@mhogeweg yes - my hope is that other groups continually put a little heat on the data.gov team to keep improving ;) |
I'm pretty psyched for the possibility of a competitor, which |
Perhaps we can maintain and publish on DataGov a list of all harvest Marion A. Royal Sent from PDA On Aug 27, 2013, at 1:40 PM, Waldo Jaquith [email protected] wrote: I'm pretty psyched for the possibility of a competitor, which — |
this conversation seems to have devolved into a bit of philosophy and diverged from the original request. I'm interested if a practical decision has prevailed. Pagination is a simple capability that every developer and tool understands well. Anyone reading this thread has probably paginated over twitter/github/email/RSS feeds/etc. Catalogs are growing in size, and as @jeffdlb points out, a good, simple, spec can grow adoption across multiple platforms and internally. We're already seeing catalogs in the 10k-100k+ range. For simple practicality, OpenSearch-Atom has helpful <link rel="self" href="http://example.com/New+York+History?pw=3&format=atom" type="application/atom+xml"/>
<link rel="first" href="http://example.com/New+York+History?pw=1&format=atom" type="application/atom+xml"/>
<link rel="previous" href="http://example.com/New+York+History?pw=2&format=atom" type="application/atom+xml"/>
<link rel="next" href="http://example.com/New+York+History?pw=4&format=atom" type="application/atom+xml"/>
<link rel="last" href="http://example.com/New+York+History?pw=42299&format=atom" type="application/atom+xml"/> |
Discussing the initial issue raised here (single file v. federated files) with others and there's still conflict over the right balance. Agencies feel the need to be able to federate but there's still a compelling interest in having the simple requirement of all data from an agency being accessed in a straightforward, direct way. |
FYI - This also overlaps with Issue #308. |
@jeffdlb, I am curious if these tools satisfy your original concerns:
|
@rebeccawilliams - Thanks for your follow-up note. The short answer is No, they unfortunately do not satisfy the original concern.
The fundamental concerns raised in my original post are that Again, thanks for following up. Regards, |
I agree with @jeffdlb view. In Esri Geoportal Server we can harvest other data.json files and provide a single file, as well as provide pagination using data.json as one of the output formats of our open search endpoint. This would allow a harvester to page through the contents of a catalog quite easily, fetching chunks of the catalog instead of a single large file. to overcome the updating issue raised, we generate a 'cached' version of the catalog's content in data.json on a regular basis (hourly, daily, weekly, depending on desired frequency). |
😂😂😂😂😂😂 |
What?
…On Wed, Jan 23, 2019, 12:49 AM Brykerr78 ***@***.***> wrote:
😂😂😂😂😂😂
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#105 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVlcqunQH7ldL6FSlfwtn9oMvojv6Bkoks5vF4dXgaJpZM4A4A1I>
.
|
Current guidance is that each agency's "/data" inventory must be a single list in a file containing multiple lines of Javascript Object Notation (JSON) summary metadata per dataset, even if our agency has tens of thousands of datasets distributed across multiple facilities and servers. I believe the single list will pose problems of inventory creation, maintenance, and usability. I enumerate my concerns below, but first I propose a specific solution.
PROPOSAL:
I recommend the single-list approach be made optional. Specifically, I suggest that the top-level JSON file be permitted to include either a list of datasets or a list of child nodes. Each node would at minimum have 'title' and 'accessURL' elements from your JSON schema (http://project-open-data.github.io/schema/), an agreed-upon value of 'format' such as "inventory_node" to indicate the destination is not a data file, and optionally some useful elements (e.g., person, mbox, modified, accessLevel, etc) describing that node. Each node could likewise include either a list of datasets or a list of children.
CONCERNS REGARDING THE SINGLE-LIST APPROACH:
(1) We should not build these inventories only to support data.gov. We want to leverage this for other efforts internal to our agencies, for PARR, and to support other external portals such as the Global Earth Observing System of Systems (GEOSS) or the Global Change Master Directory (GCMD). A distributed organization will be more useful for them (even if data.gov itself could handle a single long unsorted list.)
(2) The inventory will need to be compiled from many different sources, including multiple web-accessible folders (WAFs) of geospatial metadata, existing catalog servers, or other databases or lists. Each type of input will need some code to produce the required subset of the list, and then some software will need to merge everything into a giant list. Failure at any step in the process may cause the inventory to be out of date, incomplete, or broken entirely. A distributed organization will more easily allow most of the inventory to be correct or complete even if one sub-part is not, and will require much less code for fault-handling.
(3) Some of our data changes very frequently, on timescales of minutes or hours, while other data are only modified yearly or less frequently. A distributed organization will more easily allow partial updates and the addition (or removal) of new collections of data without having to regenerate the entire list.
(4) The inventory is supposed to include both our scientific observations and "business" data, and both public and non-public data. That alone suggests a top-level division into (for example) /data/science, /data/business, and /data/internal. The latter may need to be on a separate machine with different access control.
(5) It would be easier to create usable parallel versions of the inventory in formats other than JSON (e.g., HTML with schema.org tags) if the organization were distributed.
(6) I understand that the data.gov harvester has successfully parsed very long JSON files. However, recursive traversing of a web-based directory tree-like structure would be trivial to implement by data.gov and would be more scalable and solve many problems for the agencies and the users. data.gov's own harvesting could even be helped if the last-modified date on each node is checked to determine whether you can skip it.
The text was updated successfully, but these errors were encountered: