Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance seems way below expectations #251

Closed
Themanwithoutaplan opened this issue Jun 26, 2015 · 35 comments
Closed

Performance seems way below expectations #251

Themanwithoutaplan opened this issue Jun 26, 2015 · 35 comments
Labels

Comments

@Themanwithoutaplan
Copy link

I have a relatively simple import from CSV to Postgres that seems to be running quite slow. The import is of about 500000 rows with around 70 columns, mainly integers and four indices. The import currently takes around 5 hours. My hard disk will happily do over 10 MB/s. Running pgbench, at which I'm not an expert, suggests TPS between 70 and 200. Even with the lower number I'd expect the important to take about half-an-hour.

When running the import it does, indeed seem as a lot of time is spent in pgloader rather than in Postgres.

What should I be looking at to improve performance?

@dimitri
Copy link
Owner

dimitri commented Jun 26, 2015

How many rows where rejected during the load? if none, then try using COPY directly and report its timing for comparison purpose. If more than zero, then try to COPY out the loaded data in a clean CSV file then COPY IN again from that clean file, and report the timing.

I'm interested into making pgloader as fast as possible of course, but your case will need quite some more information before anything useful can be attempted...

@Themanwithoutaplan
Copy link
Author

No rows were rejected. Running copy seems to give similar results with disk speed rarely getting above 100 kB/s so I guess the problem is related to the nature of the data and/or server configuration. Any idea what I should be looking at?

@Themanwithoutaplan
Copy link
Author

From further experimentation it looks like it the indices that are throttling performance. Dropping them and the load is done in less than a minute.

@Themanwithoutaplan
Copy link
Author

Looks like a text index is the real bottleneck here. My other indices and single trigger hardly seem to matter.

For the docs it might be worth noting that if performance varies widely from what pg_bench suggests then indices could be a bottleneck.

table name read imported errors time
fetch 0 0 0 0.013
before load 1 1 0 0.045s
pages 482966 482966 0 1m37.037s
after load 1 1 0 2m16.884s
Total import time 482966 482966 0 3m53.979s

@dimitri
Copy link
Owner

dimitri commented Jun 27, 2015

Oh, yeah, never bulk load data with indexes present, remove them before loading, add them again at the end, which is what pgloader does for the database-like sources and when targeting an empty table.

I should maybe add an option like disable triggers that would drop then create the indexes again at the end of the run.

@Themanwithoutaplan
Copy link
Author

I'm not sure if all triggers should be disabled – I happen to use one for this particular case to convert stupid dates as strings to real dates – and the real hit is only around the text index which I need to help normalise a stupidly denormalised source in the after-load clause.

But consistent performance across modes would definitely make sense.

@dimitri
Copy link
Owner

dimitri commented Jun 27, 2015

You could normalize your input right within pgloader, several examples of date mungling are given already as transformation functions. See #245 (comment) for a full example of that.

So maybe just a warning about indexes and triggers being present on the target table with potential impact on loading performances would be in order...

@Themanwithoutaplan
Copy link
Author

Good to know that this can be done in the loading script. The normalisation, however, can't. It would involve upserting some data from the source, getting the relevant foreign key and substituting it… Going from 5 hours to a minute + a few minutes to recreate the index is the big win. I've tried, and failed, to get the source normalised.

dimitri added a commit that referenced this issue Jul 16, 2015
Pre-existing indexes will reduce data loading performances and it's
generally better to DROP the index prior to the load and CREATE them
again once the load is done. See #251 for an example of that.

In that patch we just add a WARNING against the situation, the next
patch will also add support for a new WITH clause option allowing to
have pgloader take care of the DROP/CREATE dance around the data
loading.
@dimitri
Copy link
Owner

dimitri commented Jul 16, 2015

You may try the drop indexes option to have the dance being operated by pgloader all automatically now.

@Themanwithoutaplan
Copy link
Author

Thanks, but it looks like it may need some work:
2015-07-17T12:18:20.398000+02:00 ERROR Database error 2BP01: cannot drop index pages_date_url because constraint pages_date_url on table pages requires it HINT: You can drop constraint pages_date_url on table pages instead. QUERY: DROP INDEX pages_date_url; 275MiB 0:22:51 [ 206kiB/s] [ <=> ] 2015-07-17T13:02:06.615000+02:00 ERROR Database error 42P07: relation "pages_date_url" already exists

@dimitri
Copy link
Owner

dimitri commented Jul 17, 2015

Thanks for the feedback, I only cared that way about primary key indexes, I didn't do the general constraint case (UNIQUE, EXCLUDE). Will add that as soon as possible, sorry about that.

@dimitri dimitri reopened this Jul 17, 2015
@dimitri
Copy link
Owner

dimitri commented Jul 17, 2015

Should be good now.

@Themanwithoutaplan
Copy link
Author

Seems to be working much better, thank you. If only building on MacOS without Brew was easier! Even just support for make install would be great.

@dimitri
Copy link
Owner

dimitri commented Jul 17, 2015

If you're in a position to tell me what's missing and how to make it simpler, please open an issue about that and we'll see if we can improve the situation here.

@Themanwithoutaplan
Copy link
Author

Reopening #161 would probably be best.

@dimitri
Copy link
Owner

dimitri commented Jul 18, 2015

So you're saying it's a problem with finding the shared objects (.so) files?

@Themanwithoutaplan
Copy link
Author

Are the indices being created twice? FWIW manual index creation varies significantly between the two indices here. The index of the date field is quick to recreate and also doesn't impose much of an overhead if kept when importing. The big bottleneck here is the text index.

table name read imported errors time
fetch 0 0 0 0.012s
drop indexes 1 1 0 0.073s
pages 483669 483669 0 1m50.449s
Index Build Completion 0 0 0 17m8.347s
Create Indexes 1 1 0 17m7.864s
Primary Keys 0 0 0 0.000s
after load 1 1 0 1m23.428s
Total import time 483669 483669 0 37m30.173s

@dimitri
Copy link
Owner

dimitri commented Jul 18, 2015

Seems like I've been a tad too lazy and the Index Building time isn't properly accounted for as a parallel background task. It's doubly counted but should appear only once, will fix later.

@Themanwithoutaplan
Copy link
Author

I don't think you can be called lazy! It's just not really your itch to scratch. Your work on this is much appreciated. Being able to work with this data in Postgres is so much nicer than MySQL.

dimitri added a commit that referenced this issue Jul 18, 2015
The new option 'drop indexes' reuses the existing code to build all the
indexes in parallel but failed to properly account for that fact in the
summary report with timings.

While fixing this, also fix the SQL used to re-establish the indexes and
associated constraints to allow for parallel execution, the ALTER TABLE
statements would block in ACCESS EXCLUSIVE MODE otherwise and make our
efforts vain.
@dimitri
Copy link
Owner

dimitri commented Jul 18, 2015

Looks better now, was kind of a worm hole really, because respecting the pg_dump way of doing things was too naive to allow for the kind of parallelism that was expected. Add some MySQL compatibility issues and the quick hack now takes a couple hours. Ah well, it should be all ok now!

Thanks for your continued reports, that helps make a better software.

@Themanwithoutaplan
Copy link
Author

I'm still seeing similar times for Index Build Completion and Create Indexes. Is this to be expected?

@dimitri
Copy link
Owner

dimitri commented Jul 20, 2015

Well yes. pgloader starts as many CREATE INDEX process as you have indexes to build against a single table in parallel, and then waits for all the thread to be done. The Create Indexes section counts the time it took to create the indexes in total while the Build Completion section counts how much time we had to still wait when all the other things to do were already finished.

This double accounting of sorts is more relevant in the loading from a database scenario where often enough we don't have to actually wait much for the indexes, because most of them have been already created in parallel during the other tables loading.

Maybe I should review using the same time categories in the report for single-table loading here, it looks like I shared too much code...

@Themanwithoutaplan
Copy link
Author

Looks like there are still some gremlins in this. Running again with a new dataset and it looks like the script is confused by constraints which it created last time.

2015-07-24T17:04:53.874000+02:00 ERROR Database error 42704: constraint "pages_urlshort_labeldate_key1" of relation "pages" does not exist
QUERY: ALTER TABLE pages DROP CONSTRAINT pages_urlShort_labelDate_key1;
2015-07-24T17:04:53.874000+02:00 ERROR Database error 42704: constraint "pages_urlshort_labeldate_key" of relation "pages" does not exist
QUERY: ALTER TABLE pages DROP CONSTRAINT pages_urlShort_labelDate_key;
 276MiB 1:27:54 [53,7kiB/s] [          <=>                                                                                           ]
table name read imported errors time
fetch 0 0 0
drop indexes 2 0 2 0.176s
pages 483551 483551 0 3h19m19.264s
Index Build Completion 0 0 0 39m38.657s
Create Indexes 2 2 0 56m38.589s
Primary Keys 0 0 0 0.000s
after load 1 1 0 6m54.066s
Total import time 483551 483551 0 5h2m30.766s

@dimitri
Copy link
Owner

dimitri commented Jul 25, 2015

Can you please run the following query, it might be that some indexes have been only partly deleted and that we should then not worry about them here...

select indrelid::regclass, indisvalid, indcheckxmin, indisready, indislive,
       pg_get_indexdef(indexrelid)
  from pg_index
 where indrelid = 'pages'::regclass;

The other situation where I would expect your error messages is a concurrency issue where two pgloader are working in parallel against the table, thus one of those just deleted the indexes and constraint in the time between the other process having listed the indexes and wanting to delete them...

Is a concurrency issue possible in your use case?

@Themanwithoutaplan
Copy link
Author

indrelid indisvalid indcheckxmin indisready indislive pg_get_indexdef
pages t f t t CREATE UNIQUE INDEX "pages_urlShort_labelDate_key" ON pages USING btree ("urlShort", "labelDate")
pages t f t t CREATE UNIQUE INDEX "pages_urlShort_labelDate_key1" ON pages USING btree ("urlShort", "labelDate")
pages t f t t CREATE UNIQUE INDEX "pages_urlShort_labelDate_key2" ON pages USING btree ("urlShort", "labelDate")
pages t f t t CREATE UNIQUE INDEX "pages_urlShort_labelDate_key3" ON pages USING btree ("urlShort", "labelDate")

Not sure what you mean by not worrying about them. Because they're not being managed properly the load time goes up from a minute to over 3 hours and the indexes then take longer to recreate.

@dimitri
Copy link
Owner

dimitri commented Jul 25, 2015

Here's the query that pgloader uses to list constraints and indexes that need to be handled, can you run it for me and paste its output here?

select i.relname,
       indrelid::regclass,
       indrelid,
       indisprimary,
       indisunique,
       pg_get_indexdef(indexrelid),
       c.conname,
       pg_get_constraintdef(c.oid)
  from pg_index x
       join pg_class i ON i.oid = x.indexrelid
       left join pg_constraint c ON c.conindid = i.oid
 where indrelid = 'pages'::regclass

I though before that maybe the constraint definition that pgloader wanted to take care of where actually invalid or stray definitions, hence the errors, but it seems not to be that.

Are you running several pgloader commands at once?

@Themanwithoutaplan
Copy link
Author

No, only running a single command. Here's the result.

relname indrelid indrelid indisprimary indisunique pg_get_indexdef conname pg_get_constraintdef
pages_urlShort_labelDate_key1 pages 16945 f t CREATE UNIQUE INDEX "pages_urlShort_labelDate_key1" ON pages USING btree ("urlShort", "labelDate") pages_urlShort_labelDate_key1 UNIQUE ("urlShort", "labelDate")
pages_urlShort_labelDate_key pages 16945 f t CREATE UNIQUE INDEX "pages_urlShort_labelDate_key" ON pages USING btree ("urlShort", "labelDate") pages_urlShort_labelDate_key UNIQUE ("urlShort", "labelDate")
pages_urlShort_labelDate_key2 pages 16945 f t CREATE UNIQUE INDEX "pages_urlShort_labelDate_key2" ON pages USING btree ("urlShort", "labelDate") pages_urlShort_labelDate_key2 UNIQUE ("urlShort", "labelDate")
pages_urlShort_labelDate_key3 pages 16945 f t CREATE UNIQUE INDEX "pages_urlShort_labelDate_key3" ON pages USING btree ("urlShort", "labelDate") pages_urlShort_labelDate_key3 UNIQUE ("urlShort", "labelDate")

@dimitri
Copy link
Owner

dimitri commented Jul 25, 2015

I still don't understand the error messages on the constraint that doesn't exists, because the constraint is listed here. Now, why do you have 4 times the same index? 2 load attempts with the same error on DROP I presume?

@Themanwithoutaplan
Copy link
Author

pgloader is doing all the work so presumably it's getting something wrong when it tries to drop them and therefore goes on to create duplicates.

@dimitri
Copy link
Owner

dimitri commented Jul 25, 2015

Can you give me a reproducible test-case so that I can then fix this bug? An example is https://github.com/dimitri/pgloader/blob/master/test/csv-districts.load which still needs a data file, or if you can prepare one all-included take https://github.com/dimitri/pgloader/blob/master/test/csv-before-after.load as a base example.

@dimitri dimitri reopened this Jul 25, 2015
@Themanwithoutaplan
Copy link
Author

You can use the import script at https://bitbucket.org/charlie_x/python-httparchive/src/7f7d8a3cae1652a789096d3432e2eacbba65e05e/db/httparchive.load?at=default

The relevant Postgres schema is at https://bitbucket.org/charlie_x/python-httparchive/src/7f7d8a3cae1652a789096d3432e2eacbba65e05e/db/pages.sql?at=default

Data can be imported from http://httparchive.org/downloads.php (any CSV dump for pages after 2014-06-01).

Import gzip -d -c httparchive_Jul_15_2015_pages.csv.gz | pgloader db/httparchive.load

@dimitri
Copy link
Owner

dimitri commented Jul 26, 2015

Thanks for the complete use case, I could fix the issue at hand, namely pgloader trying to second guess the spelling of the constraint and indexes (down casing and normalizing them as if they came from a MySQL or SQLite database). The latest patch stops this madness by having the drop indexes related code path force quoting the names just as we got them.

Should be good now, in as much as I could reproduce your problem then fix it!

@Themanwithoutaplan
Copy link
Author

Second guessing is almost always the wrong way to go but sometimes there's no choice. FWIW you might appreciate some historical context behind this import: I tried and failed to get the original MySQL improved. It would have obviated the need to work around this bottleneck index: https://code.google.com/p/httparchive/issues/detail?id=65

Any progress on my other ticket about building on OS X with MacPorts?

@dimitri
Copy link
Owner

dimitri commented Jul 26, 2015

Thanks for the interesting context! It's also nice to see those .load files in another Open Source project ;-)

About #261 let's say that all this shared object dependency hell is some over my head. I also have #159 and #160 on my plate, more generally see Build System for a listing.

I need to find a proper way to make pgloader easier to install for everyone. What normally happens is that packagers show up and do the work for each distro, like I did for debian. It's yet to happen for other OSes apparently.

@Themanwithoutaplan
Copy link
Author

Well, in case I didn't make it clear enough: I failed in my attempt to get the schema cleaned up so I forked the site from PHP to Pyramid. Then I kept hitting MySQL's limitations so started to port to Postgres for my own reporting. I haven't got all the way to properly cloning the crawler and stats part…

What is interesting, however, is how fast the MySQL import is, even with indexes on. Of course, this is done at the cost of a table lock and schema changes are very painful: table has to be dumped, altered and imported. This often leads to the disk running full. You can see how this encourages the persistence of bad design decisions: schema changes are expensive; you won't be punished for not normalising the data.

Wish I could be more help with the build instructions but I'm afraid it's something I've got little experience with myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants