-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement script that can populate the DB from Open Referral-compliant CSV files. #243
Comments
Thanks for your comment, and sorry for the confusion. The current import script is based on the way we transformed San Mateo County's data from the MARC 21 format to JSON. At the time, converting to JSON made the most sense since we were working with just a flat file and had to create the schema from that file. Now that we are developing a Human Services Data Specification, aka Open Referral, the goal is to replace the current script with one that takes Open Referral-compliant CSV files and populates the DB from them. The spec is only at version 0.2 at the moment. 1.0 is scheduled to be released on January 5, 2015. Overall, I don't think the schema will change dramatically between now and then, but some of the fields within each table (or CSV file) might. So, to start experimenting with this new CSV-based import script, the first step would be to obtain a demo set of CSV files, which I think @spara was working on providing. I will rename this Issue to represent the need for a new import script based on Open Referral CSV files. |
Working on this now. Hope to have it done by the end of the week. |
👍 |
Here is a set of valid CSV files for reference: https://docs.google.com/a/codeforamerica.org/spreadsheets/d/1qkysLj5MAAvaysuac_9H0hOvTAb1aJEMhUVJAECmj2c/edit#gid=0 Next step is to write a script to parse the CSV files and import to the DB. |
A couple of thoughts on this. Short feedback loop –– you shouldn't need to wait till the end of the script to see the errors. Maybe stopping the import process when an error occurs to give the user time to correct it before resuming again is the way to go. API endpoint to bulk upload data –– this would make it possible something I've heard a lot of requests for: bulk updates/uploads from the admin interface. It could also be very useful in the "welcome" interface. |
How would this work when you're reading from CSV files?
What kind of bulk updates? What kind of bulk uploads that can't be performed with the import script?
Why can't the "welcome" interface use the CSV-to-DB import script? |
I started a prototype of this the other day. I can look it up tomorrow, but the basic idea was to use a JSON stream parser to consume chunks of the file without buffering the whole thing, so you can stop at any point to provide feedback and resume again. I think my plan then was to use the Google Docs API, which can give very detailed info about where the data is coming from, and if it generates an error, you can point them to the row number so they can edit it. You might be able to use this with an uploaded CSV file, but the error/feedback/edit loop wouldn't be as swift.
Updates to data that have some sort of access control on them, like an organization bulk uploading services and locations, or an app developer updating the said services for the organization.
I guess it could. That doesn't need to be part of the API because it doesn't need any sort of access control check and could only be done by the super_admin users. |
The stream parser is probably an overkill. I remember that's where I got stuck with, but, as an alternative, maybe you might be able to save which row where the error happened and then modify the request to Google Docs using the row number as a parameter (not sure how to do that yet), which would lessen the annoyance of long wait times as the script catches up to where the error had happened. Here is a url I had prepared for a mock spreadsheet import file. |
What I'm working on right now is a simple solution that parses local CSV files, not Google Docs. If an error is encountered, it will output to the command line the line number and the column that has the error. |
Here is what I have so far: 925d44c |
This is done now and the Wiki article has been updated: https://github.com/codeforamerica/ohana-api/wiki/Populating-the-Postgres-database-from-OpenReferral-compliant-CSV-files. Anyone with OR-compliant CSV files can now import them into Ohana API. |
Oh, and instructions for running the script are in the updated INSTALL.md: https://github.com/codeforamerica/ohana-api/blob/master/INSTALL.md#uploading-and-validating-your-own-data |
One of the main challenges I am seeing in importing data into new Ohana installations is the layers of abstraction I have to jump through in order to get the datasets from flat csv files into the nested JSON file that Ohana expects.
What if we made this easier? Why can't we make the expected data into a denormalized version of the database schema, and not bother the maintainers with transforming that into any intermediary data formats?
In addition, it's common for the service providers to organize their data around services, rather than organizations. So why does the import script assume the organizations will be at the top-level?
My ideal import file would be a csv file where each row would represent a single service, and it would take care of the joins in the script.
For example:
The text was updated successfully, but these errors were encountered: