-
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tutorial or how-to on importing large CSVs #118
Comments
https://datasette.io/tutorials could grow a "how-to" or "recipes" section. |
Great idea. I think there are 3 steps. 1. Build the schema.What I did, but there is probably better options: I imported my CSV "as is" and then use Datasette to copy and paste request 'CREATE TABLE" at the end of the page, and then edited it by hand to select the different field types. I have tried different tools to generate a good schema but none of them worked well... For example: our 2. Build the indexCreate a file with one index per line. Eg. 3. Then try to import in one commandtime sqlite3 products_new.db <<EOS
/* Here we could add examples of PRAGMA optimisations */
/* [...] */
/* Import schema */
.read create.schema
/* Options to be tuned */
/* .mode ascii or .mode csv */
.mode ascii
/* .separator is not necessary when .mode csv */
.separator "\t" "\n"
/* --skip 1 is only useful with .mode ascii, if there is a header */
.import --skip 1 en.openfoodfacts.org.products.csv all
/* At the end, create all the index */
.read index.schema
EOS |
About optimisation, the only thing that works for me for a big CSV (6.2GB, 2.5 millions rows, 185+ fields) is: It's open data, anyone should be able to reproduce my test (needs ~18GB):
|
Using the 2.5GB CSV from this may be good here: https://simonwillison.net/2022/Sep/29/webvid/ |
This came up on Discord: https://discord.com/channels/823971286308356157/823971286941302908/1017006129831739432
If you have a large (e.g. 9.5GB) CSV file it's not obvious how best to import it.
csvs-to-sqlite
tries to load the whole thing into RAM, which isn't ideal.sqlite-utils insert
can stream it, which is better, but it's still quite slow. The best option is actually to create the table manually and then usesqlite3 .import
to import the CSV, as described here: https://til.simonwillison.net/sqlite/import-csv - but it's not exactly obvious!Documentation could help here.
This is actually more of a "how-to" in the https://diataxis.fr/ framework as opposed to a tutorial.
The text was updated successfully, but these errors were encountered: