Download all of data.gov #113

flyingzumwalt · 2017-01-17T17:38:54Z

For more info about this task, what we will do with the data, and how it relates to other archival efforts, see Issue #87

Story

Jack Downloads all of the datasets from data.gov (~350TB) to storage devices on Stanford's network.

What will be Downloaded

The data.gov website is a portal that allows you to find all the "open data" datasets published by US federal agencies. It currently lists over 190,000 datasets

The goal is to download those datasets, back them up, and use IPFS to replicate the data across a network of participating/collaborating nodes.

@mejackreed has posted all of the metadata from data.gov, which cointains pointers to the datasets and basic metadata about them. The metadata are in ckan.json files. You can view the metadata at https://github.com/OpenGeoMetadata/gov.data That will be the main starting point for running all of the scripts that download the datasets.

jonnycrunch · 2017-01-17T19:24:27Z

Does this really need to be >300TB. After looking at the data, there is a lot of data redundancy. Same data is in csv, html and json. does only one organization have to load the entire 300 TB? Most of the data can be broken up to 'health', 'environment', "agriculture' and is composed on heterogeneous files ( typically a few hundred MB per file.) The meta data describing the data would be most important ( Publisher, Identifier, modified date, etc).

mejackreed · 2017-01-17T20:03:42Z

We have the ckan metadata already. And yes I agree some of the data is redundant, based on how ArcGIS OpenData allows for different types of exports. A smarter heuristic of this would be nice, but may take some more analysis time.

flyingzumwalt · 2017-01-18T18:56:43Z

@mejackreed do you think you will need help writing the download scripts or running them? We can probably find people to help you.

mejackreed · 2017-01-18T19:09:05Z

Sure thing. Help definitely wanted! I have a naive downloader here: https://github.com/mejackreed/GovScooper/blob/master/README.md#usage already.

flyingzumwalt · 2017-01-18T19:22:08Z

cc @jbenet @gsf @b5

b5 · 2017-01-18T19:32:50Z

Happy to help!

I think it makes sense to first decide weather or not to download in passes, using metadata to cut down on data redundancy (as per @jonnycrunch's suggestion), or to just beef the whole thing. I'd personally vote for the "passes" approach, but first checking to ensure that the data is truly redundant.

mejackreed · 2017-01-18T19:38:34Z

Yep i have an idea on how to evaluate whether or not the data is redundant or not. Resources that come from a server that has /arcgis.com/ and have .geojson + .csv + .kml are usually just transformations of the same data. A way to understand these types of datasets / resources and codify the heuristics is needed.

An example: https://github.com/OpenGeoMetadata/gov.data/blob/8f440134f13e7559086e7a07b8081098198c9a18/ad/01/6d/50/3d/38/4b/50/bc/b9/e5/62/2f/d7/c0/1b/ad016d503d384b50bcb9e5622fd7c01b/ckan.json

jonnycrunch · 2017-01-18T19:57:06Z

There are 194422 distinct entries in catalog. Meta data is about 2GB.

https://catalog.data.gov/api/3/action/package_search?rows=1&start=0

Here is an example of one entry:
https://catalog.data.gov/api/3/action/package_show?id=1e68f387-5f1c-46c0-a0d1-46044ffef5bf

each entry has a resource list:

First pass could be to hit all of the URLs in each resource grab 'Content-Length' headers to calculate the exact amount of space needed simultaneously gathering all of the necessary resource urls.

There are also some meta schema resources referenced in the 'extras' section that would be important to grab: https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld

jonnycrunch · 2017-01-18T20:00:58Z

Mmm, at least there were 194422 entries, now there are only 194401. Now I understand the urgency!

b5 · 2017-01-18T20:02:27Z

+1 for hitting all resources for content length. I'd add grabbing filetype while we're at it. quick browsing showed some of the resources listed were .zip archives (ugh)

mejackreed · 2017-01-18T20:05:40Z

So in my initial tests of downloading these resources, many of them do not return Content-Length header unfortunately. Hoping to kick off some larger runs this afternoon to get more details.

mejackreed · 2017-01-18T20:06:23Z

@jonnycrunch 194014 entries here: https://github.com/OpenGeoMetadata/gov.data

Best to grab the archive.zip and easy to parse the layers.json file

flyingzumwalt added the ready label Jan 17, 2017

flyingzumwalt added this to the Data.gov (aka 300 TB Challenge) milestone Jan 17, 2017

flyingzumwalt mentioned this issue Jan 17, 2017

Main data.gov Epic: Replicate 350 TB of Data Between 3 Peers (and then the World) #104

Open

flyingzumwalt changed the title ~~Download all of data.gov~~ Download all of data.gov Jan 17, 2017

flyingzumwalt added in progress and removed ready labels Jan 17, 2017

flyingzumwalt assigned mejackreed Jan 17, 2017

flyingzumwalt mentioned this issue Jan 18, 2017

Sprint: Data.gov (aka 300 TB Challenge) #87

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download all of data.gov #113

Download all of data.gov #113

flyingzumwalt commented Jan 17, 2017 •

edited

Loading

jonnycrunch commented Jan 17, 2017

mejackreed commented Jan 17, 2017

flyingzumwalt commented Jan 18, 2017

mejackreed commented Jan 18, 2017

flyingzumwalt commented Jan 18, 2017

b5 commented Jan 18, 2017

mejackreed commented Jan 18, 2017 •

edited

Loading

jonnycrunch commented Jan 18, 2017

jonnycrunch commented Jan 18, 2017

b5 commented Jan 18, 2017

mejackreed commented Jan 18, 2017

mejackreed commented Jan 18, 2017

Download all of data.gov #113

Download all of data.gov #113

Comments

flyingzumwalt commented Jan 17, 2017 • edited Loading

Story

What will be Downloaded

jonnycrunch commented Jan 17, 2017

mejackreed commented Jan 17, 2017

flyingzumwalt commented Jan 18, 2017

mejackreed commented Jan 18, 2017

flyingzumwalt commented Jan 18, 2017

b5 commented Jan 18, 2017

mejackreed commented Jan 18, 2017 • edited Loading

jonnycrunch commented Jan 18, 2017

jonnycrunch commented Jan 18, 2017

b5 commented Jan 18, 2017

mejackreed commented Jan 18, 2017

mejackreed commented Jan 18, 2017

flyingzumwalt commented Jan 17, 2017 •

edited

Loading

mejackreed commented Jan 18, 2017 •

edited

Loading