Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement script that can populate the DB from Open Referral-compliant CSV files. #243

Closed
volkanunsal opened this issue Oct 12, 2014 · 12 comments

Comments

@volkanunsal
Copy link
Contributor

One of the main challenges I am seeing in importing data into new Ohana installations is the layers of abstraction I have to jump through in order to get the datasets from flat csv files into the nested JSON file that Ohana expects.

What if we made this easier? Why can't we make the expected data into a denormalized version of the database schema, and not bother the maintainers with transforming that into any intermediary data formats?

In addition, it's common for the service providers to organize their data around services, rather than organizations. So why does the import script assume the organizations will be at the top-level?

My ideal import file would be a csv file where each row would represent a single service, and it would take care of the joins in the script.

For example:

service.name service.organization_name service.organization.location.name
Home and health service Red Cross Shelter XYX
@monfresh
Copy link
Member

Thanks for your comment, and sorry for the confusion. The current import script is based on the way we transformed San Mateo County's data from the MARC 21 format to JSON. At the time, converting to JSON made the most sense since we were working with just a flat file and had to create the schema from that file.

Now that we are developing a Human Services Data Specification, aka Open Referral, the goal is to replace the current script with one that takes Open Referral-compliant CSV files and populates the DB from them.

The spec is only at version 0.2 at the moment. 1.0 is scheduled to be released on January 5, 2015. Overall, I don't think the schema will change dramatically between now and then, but some of the fields within each table (or CSV file) might. So, to start experimenting with this new CSV-based import script, the first step would be to obtain a demo set of CSV files, which I think @spara was working on providing.

I will rename this Issue to represent the need for a new import script based on Open Referral CSV files.

@monfresh monfresh changed the title [RFC] Making data import easy Implement script that can populate the DB from Open Referral-compliant CSV files. Oct 13, 2014
@monfresh
Copy link
Member

Working on this now. Hope to have it done by the end of the week.

@volkanunsal
Copy link
Contributor Author

👍

@monfresh
Copy link
Member

Here is a set of valid CSV files for reference: https://docs.google.com/a/codeforamerica.org/spreadsheets/d/1qkysLj5MAAvaysuac_9H0hOvTAb1aJEMhUVJAECmj2c/edit#gid=0

Next step is to write a script to parse the CSV files and import to the DB.

@volkanunsal
Copy link
Contributor Author

A couple of thoughts on this.

Short feedback loop –– you shouldn't need to wait till the end of the script to see the errors. Maybe stopping the import process when an error occurs to give the user time to correct it before resuming again is the way to go.

API endpoint to bulk upload data –– this would make it possible something I've heard a lot of requests for: bulk updates/uploads from the admin interface. It could also be very useful in the "welcome" interface.

@monfresh
Copy link
Member

Short feedback loop –– you shouldn't need to wait till the end of the script to see the errors. Maybe stopping the import process when an error occurs to give the user time to correct it before resuming again is the way to go.

How would this work when you're reading from CSV files?

API endpoint to bulk upload data –– this would make it possible something I've heard a lot of requests for: bulk updates/uploads from the admin interface.

What kind of bulk updates? What kind of bulk uploads that can't be performed with the import script?

It could also be very useful in the "welcome" interface.

Why can't the "welcome" interface use the CSV-to-DB import script?

@volkanunsal
Copy link
Contributor Author

How would this work when you're reading from CSV files?

I started a prototype of this the other day. I can look it up tomorrow, but the basic idea was to use a JSON stream parser to consume chunks of the file without buffering the whole thing, so you can stop at any point to provide feedback and resume again. I think my plan then was to use the Google Docs API, which can give very detailed info about where the data is coming from, and if it generates an error, you can point them to the row number so they can edit it. You might be able to use this with an uploaded CSV file, but the error/feedback/edit loop wouldn't be as swift.

What kind of bulk uploads that can't be performed with the import script?

Updates to data that have some sort of access control on them, like an organization bulk uploading services and locations, or an app developer updating the said services for the organization.

Why can't the "welcome" interface use the CSV-to-DB import script?

I guess it could. That doesn't need to be part of the API because it doesn't need any sort of access control check and could only be done by the super_admin users.

@volkanunsal
Copy link
Contributor Author

The stream parser is probably an overkill. I remember that's where I got stuck with, but, as an alternative, maybe you might be able to save which row where the error happened and then modify the request to Google Docs using the row number as a parameter (not sure how to do that yet), which would lessen the annoyance of long wait times as the script catches up to where the error had happened.

Here is a url I had prepared for a mock spreadsheet import file.

screen shot 2014-11-13 at 2 26 47 pm

@monfresh
Copy link
Member

What I'm working on right now is a simple solution that parses local CSV files, not Google Docs. If an error is encountered, it will output to the command line the line number and the column that has the error.

@monfresh
Copy link
Member

Here is what I have so far: 925d44c

@monfresh
Copy link
Member

This is done now and the Wiki article has been updated: https://github.com/codeforamerica/ohana-api/wiki/Populating-the-Postgres-database-from-OpenReferral-compliant-CSV-files.

Anyone with OR-compliant CSV files can now import them into Ohana API.

@monfresh
Copy link
Member

Oh, and instructions for running the script are in the updated INSTALL.md: https://github.com/codeforamerica/ohana-api/blob/master/INSTALL.md#uploading-and-validating-your-own-data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants