Tutorial or how-to on importing large CSVs #118

simonw · 2022-09-07T20:45:17Z

This came up on Discord: https://discord.com/channels/823971286308356157/823971286941302908/1017006129831739432

If you have a large (e.g. 9.5GB) CSV file it's not obvious how best to import it.

csvs-to-sqlite tries to load the whole thing into RAM, which isn't ideal. sqlite-utils insert can stream it, which is better, but it's still quite slow. The best option is actually to create the table manually and then use sqlite3 .import to import the CSV, as described here: https://til.simonwillison.net/sqlite/import-csv - but it's not exactly obvious!

Documentation could help here.

This is actually more of a "how-to" in the https://diataxis.fr/ framework as opposed to a tutorial.

The text was updated successfully, but these errors were encountered:

simonw · 2022-09-07T20:45:42Z

https://datasette.io/tutorials could grow a "how-to" or "recipes" section.

CharlesNepote · 2022-09-08T17:50:03Z

Great idea. I think there are 3 steps.

1. Build the schema.

What I did, but there is probably better options: I imported my CSV "as is" and then use Datasette to copy and paste request 'CREATE TABLE" at the end of the page, and then edited it by hand to select the different field types. I have tried different tools to generate a good schema but none of them worked well... For example: our code is often detected as an INTEGER while it's not (some codes begin with 0).

2. Build the index

Create a file with one index per line. Eg.
CREATE INDEX ["all_countries_en"] ON [all]("countries_en");

3. Then try to import in one command

time sqlite3 products_new.db <<EOS
/* Here we could add examples of PRAGMA optimisations */
/* [...] */
/* Import schema */
.read create.schema
/* Options to be tuned */
/* .mode ascii or .mode csv */
.mode ascii
/* .separator is not necessary when .mode csv */
.separator "\t" "\n"
/* --skip 1 is only useful with .mode ascii, if there is a header */
.import --skip 1 en.openfoodfacts.org.products.csv all
/* At the end, create all the index */
.read index.schema
EOS

CharlesNepote · 2022-09-09T14:56:32Z

About optimisation, the only thing that works for me for a big CSV (6.2GB, 2.5 millions rows, 185+ fields) is:
PRAGMA page_size = 32768;
Before: CSV 6.2GB => time 3m2 => DB 9.5GB
After: CSV 6.2GB => time 2m3 => DB 6.6GB

It's open data, anyone should be able to reproduce my test (needs ~18GB):

wget https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv
CREATE=`curl -s -L https://gist.githubusercontent.com/CharlesNepote/80fb813a416ad445fdd6e4738b4c8156/raw/032af70de631ff1c4dd09d55360f242949dcc24f/create.sql`
INDEX=`curl -s -L https://gist.githubusercontent.com/CharlesNepote/80fb813a416ad445fdd6e4738b4c8156/raw/032af70de631ff1c4dd09d55360f242949dcc24f/index.sql`
time sqlite3 products_new.db <<EOS
/* Optimisations. See: https://avi.im/blag/2021/fast-sqlite-inserts/ */;
PRAGMA page_size = 32768;
$CREATE
.mode ascii
.separator "\t" "\n"
.import --skip 1 en.openfoodfacts.org.products.csv all
$INDEX
EOS

simonw · 2022-09-30T17:40:20Z

Using the 2.5GB CSV from this may be good here: https://simonwillison.net/2022/Sep/29/webvid/

simonw added documentation Improvements or additions to documentation tutorials labels Sep 7, 2022

simonw changed the title ~~Tutorial or guide on importing large CSVs~~ Tutorial or how-to on importing large CSVs Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial or how-to on importing large CSVs #118

Tutorial or how-to on importing large CSVs #118

simonw commented Sep 7, 2022

simonw commented Sep 7, 2022

CharlesNepote commented Sep 8, 2022

CharlesNepote commented Sep 9, 2022

simonw commented Sep 30, 2022

Tutorial or how-to on importing large CSVs #118

Tutorial or how-to on importing large CSVs #118

Comments

simonw commented Sep 7, 2022

simonw commented Sep 7, 2022

CharlesNepote commented Sep 8, 2022

1. Build the schema.

2. Build the index

3. Then try to import in one command

CharlesNepote commented Sep 9, 2022

simonw commented Sep 30, 2022