Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial or how-to on importing large CSVs #118

Open
simonw opened this issue Sep 7, 2022 · 4 comments
Open

Tutorial or how-to on importing large CSVs #118

simonw opened this issue Sep 7, 2022 · 4 comments
Labels
documentation Improvements or additions to documentation tutorials

Comments

@simonw
Copy link
Owner

simonw commented Sep 7, 2022

This came up on Discord: https://discord.com/channels/823971286308356157/823971286941302908/1017006129831739432

If you have a large (e.g. 9.5GB) CSV file it's not obvious how best to import it.

csvs-to-sqlite tries to load the whole thing into RAM, which isn't ideal. sqlite-utils insert can stream it, which is better, but it's still quite slow. The best option is actually to create the table manually and then use sqlite3 .import to import the CSV, as described here: https://til.simonwillison.net/sqlite/import-csv - but it's not exactly obvious!

Documentation could help here.

This is actually more of a "how-to" in the https://diataxis.fr/ framework as opposed to a tutorial.

@simonw simonw added documentation Improvements or additions to documentation tutorials labels Sep 7, 2022
@simonw
Copy link
Owner Author

simonw commented Sep 7, 2022

https://datasette.io/tutorials could grow a "how-to" or "recipes" section.

@simonw simonw changed the title Tutorial or guide on importing large CSVs Tutorial or how-to on importing large CSVs Sep 7, 2022
@CharlesNepote
Copy link

Great idea. I think there are 3 steps.

1. Build the schema.

What I did, but there is probably better options: I imported my CSV "as is" and then use Datasette to copy and paste request 'CREATE TABLE" at the end of the page, and then edited it by hand to select the different field types. I have tried different tools to generate a good schema but none of them worked well... For example: our code is often detected as an INTEGER while it's not (some codes begin with 0).

2. Build the index

Create a file with one index per line. Eg.
CREATE INDEX ["all_countries_en"] ON [all]("countries_en");

3. Then try to import in one command

time sqlite3 products_new.db <<EOS
/* Here we could add examples of PRAGMA optimisations */
/* [...] */
/* Import schema */
.read create.schema
/* Options to be tuned */
/* .mode ascii or .mode csv */
.mode ascii
/* .separator is not necessary when .mode csv */
.separator "\t" "\n"
/* --skip 1 is only useful with .mode ascii, if there is a header */
.import --skip 1 en.openfoodfacts.org.products.csv all
/* At the end, create all the index */
.read index.schema
EOS

@CharlesNepote
Copy link

About optimisation, the only thing that works for me for a big CSV (6.2GB, 2.5 millions rows, 185+ fields) is:
PRAGMA page_size = 32768;
Before: CSV 6.2GB => time 3m2 => DB 9.5GB
After: CSV 6.2GB => time 2m3 => DB 6.6GB

It's open data, anyone should be able to reproduce my test (needs ~18GB):

wget https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv
CREATE=`curl -s -L https://gist.githubusercontent.com/CharlesNepote/80fb813a416ad445fdd6e4738b4c8156/raw/032af70de631ff1c4dd09d55360f242949dcc24f/create.sql`
INDEX=`curl -s -L https://gist.githubusercontent.com/CharlesNepote/80fb813a416ad445fdd6e4738b4c8156/raw/032af70de631ff1c4dd09d55360f242949dcc24f/index.sql`
time sqlite3 products_new.db <<EOS
/* Optimisations. See: https://avi.im/blag/2021/fast-sqlite-inserts/ */;
PRAGMA page_size = 32768;
$CREATE
.mode ascii
.separator "\t" "\n"
.import --skip 1 en.openfoodfacts.org.products.csv all
$INDEX
EOS

@simonw
Copy link
Owner Author

simonw commented Sep 30, 2022

Using the 2.5GB CSV from this may be good here: https://simonwillison.net/2022/Sep/29/webvid/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation tutorials
Projects
None yet
Development

No branches or pull requests

2 participants