Skip to content
This repository has been archived by the owner on Apr 16, 2020. It is now read-only.

Download all of data.gov #113

Open
flyingzumwalt opened this issue Jan 17, 2017 · 12 comments
Open

Download all of data.gov #113

flyingzumwalt opened this issue Jan 17, 2017 · 12 comments
Assignees

Comments

@flyingzumwalt
Copy link
Contributor

flyingzumwalt commented Jan 17, 2017

For more info about this task, what we will do with the data, and how it relates to other archival efforts, see Issue #87

Story

Jack Downloads all of the datasets from data.gov (~350TB) to storage devices on Stanford's network.

What will be Downloaded

The data.gov website is a portal that allows you to find all the "open data" datasets published by US federal agencies. It currently lists over 190,000 datasets

The goal is to download those datasets, back them up, and use IPFS to replicate the data across a network of participating/collaborating nodes.

@mejackreed has posted all of the metadata from data.gov, which cointains pointers to the datasets and basic metadata about them. The metadata are in ckan.json files. You can view the metadata at https://github.com/OpenGeoMetadata/gov.data That will be the main starting point for running all of the scripts that download the datasets.

@jonnycrunch
Copy link

Does this really need to be >300TB. After looking at the data, there is a lot of data redundancy. Same data is in csv, html and json. does only one organization have to load the entire 300 TB? Most of the data can be broken up to 'health', 'environment', "agriculture' and is composed on heterogeneous files ( typically a few hundred MB per file.) The meta data describing the data would be most important ( Publisher, Identifier, modified date, etc).

@mejackreed
Copy link

We have the ckan metadata already. And yes I agree some of the data is redundant, based on how ArcGIS OpenData allows for different types of exports. A smarter heuristic of this would be nice, but may take some more analysis time.

@flyingzumwalt flyingzumwalt changed the title Download all of data.gov Download all of data.gov Jan 17, 2017
@flyingzumwalt
Copy link
Contributor Author

@mejackreed do you think you will need help writing the download scripts or running them? We can probably find people to help you.

@mejackreed
Copy link

Sure thing. Help definitely wanted! I have a naive downloader here: https://github.com/mejackreed/GovScooper/blob/master/README.md#usage already.

@flyingzumwalt
Copy link
Contributor Author

cc @jbenet @gsf @b5

@b5
Copy link

b5 commented Jan 18, 2017

Happy to help!

I think it makes sense to first decide weather or not to download in passes, using metadata to cut down on data redundancy (as per @jonnycrunch's suggestion), or to just beef the whole thing. I'd personally vote for the "passes" approach, but first checking to ensure that the data is truly redundant.

@mejackreed
Copy link

mejackreed commented Jan 18, 2017

Yep i have an idea on how to evaluate whether or not the data is redundant or not. Resources that come from a server that has /arcgis.com/ and have .geojson + .csv + .kml are usually just transformations of the same data. A way to understand these types of datasets / resources and codify the heuristics is needed.

An example: https://github.com/OpenGeoMetadata/gov.data/blob/8f440134f13e7559086e7a07b8081098198c9a18/ad/01/6d/50/3d/38/4b/50/bc/b9/e5/62/2f/d7/c0/1b/ad016d503d384b50bcb9e5622fd7c01b/ckan.json

@jonnycrunch
Copy link

There are 194422 distinct entries in catalog. Meta data is about 2GB.

https://catalog.data.gov/api/3/action/package_search?rows=1&start=0

Here is an example of one entry:
https://catalog.data.gov/api/3/action/package_show?id=1e68f387-5f1c-46c0-a0d1-46044ffef5bf

each entry has a resource list:

First pass could be to hit all of the URLs in each resource grab 'Content-Length' headers to calculate the exact amount of space needed simultaneously gathering all of the necessary resource urls.

There are also some meta schema resources referenced in the 'extras' section that would be important to grab: https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld

@jonnycrunch
Copy link

Mmm, at least there were 194422 entries, now there are only 194401. Now I understand the urgency!

@b5
Copy link

b5 commented Jan 18, 2017

+1 for hitting all resources for content length. I'd add grabbing filetype while we're at it. quick browsing showed some of the resources listed were .zip archives (ugh)

@mejackreed
Copy link

So in my initial tests of downloading these resources, many of them do not return Content-Length header unfortunately. Hoping to kick off some larger runs this afternoon to get more details.

@mejackreed
Copy link

@jonnycrunch 194014 entries here: https://github.com/OpenGeoMetadata/gov.data

Best to grab the archive.zip and easy to parse the layers.json file

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants