-
-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance seems way below expectations #251
Comments
How many rows where rejected during the load? if none, then try using COPY directly and report its timing for comparison purpose. If more than zero, then try to COPY out the loaded data in a clean CSV file then COPY IN again from that clean file, and report the timing. I'm interested into making pgloader as fast as possible of course, but your case will need quite some more information before anything useful can be attempted... |
No rows were rejected. Running copy seems to give similar results with disk speed rarely getting above 100 kB/s so I guess the problem is related to the nature of the data and/or server configuration. Any idea what I should be looking at? |
From further experimentation it looks like it the indices that are throttling performance. Dropping them and the load is done in less than a minute. |
Looks like a text index is the real bottleneck here. My other indices and single trigger hardly seem to matter. For the docs it might be worth noting that if performance varies widely from what pg_bench suggests then indices could be a bottleneck.
|
Oh, yeah, never bulk load data with indexes present, remove them before loading, add them again at the end, which is what pgloader does for the database-like sources and when targeting an empty table. I should maybe add an option like |
I'm not sure if all triggers should be disabled – I happen to use one for this particular case to convert stupid dates as strings to real dates – and the real hit is only around the text index which I need to help normalise a stupidly denormalised source in the after-load clause. But consistent performance across modes would definitely make sense. |
You could normalize your input right within pgloader, several examples of date mungling are given already as transformation functions. See #245 (comment) for a full example of that. So maybe just a warning about indexes and triggers being present on the target table with potential impact on loading performances would be in order... |
Good to know that this can be done in the loading script. The normalisation, however, can't. It would involve upserting some data from the source, getting the relevant foreign key and substituting it… Going from 5 hours to a minute + a few minutes to recreate the index is the big win. I've tried, and failed, to get the source normalised. |
Pre-existing indexes will reduce data loading performances and it's generally better to DROP the index prior to the load and CREATE them again once the load is done. See #251 for an example of that. In that patch we just add a WARNING against the situation, the next patch will also add support for a new WITH clause option allowing to have pgloader take care of the DROP/CREATE dance around the data loading.
You may try the |
Thanks, but it looks like it may need some work: |
Thanks for the feedback, I only cared that way about primary key indexes, I didn't do the general constraint case (UNIQUE, EXCLUDE). Will add that as soon as possible, sorry about that. |
Should be good now. |
Seems to be working much better, thank you. If only building on MacOS without Brew was easier! Even just support for |
If you're in a position to tell me what's missing and how to make it simpler, please open an issue about that and we'll see if we can improve the situation here. |
Reopening #161 would probably be best. |
So you're saying it's a problem with finding the shared objects (.so) files? |
Are the indices being created twice? FWIW manual index creation varies significantly between the two indices here. The index of the date field is quick to recreate and also doesn't impose much of an overhead if kept when importing. The big bottleneck here is the text index.
|
Seems like I've been a tad too lazy and the Index Building time isn't properly accounted for as a parallel background task. It's doubly counted but should appear only once, will fix later. |
I don't think you can be called lazy! It's just not really your itch to scratch. Your work on this is much appreciated. Being able to work with this data in Postgres is so much nicer than MySQL. |
The new option 'drop indexes' reuses the existing code to build all the indexes in parallel but failed to properly account for that fact in the summary report with timings. While fixing this, also fix the SQL used to re-establish the indexes and associated constraints to allow for parallel execution, the ALTER TABLE statements would block in ACCESS EXCLUSIVE MODE otherwise and make our efforts vain.
Looks better now, was kind of a worm hole really, because respecting the pg_dump way of doing things was too naive to allow for the kind of parallelism that was expected. Add some MySQL compatibility issues and the quick hack now takes a couple hours. Ah well, it should be all ok now! Thanks for your continued reports, that helps make a better software. |
I'm still seeing similar times for Index Build Completion and Create Indexes. Is this to be expected? |
Well yes. pgloader starts as many CREATE INDEX process as you have indexes to build against a single table in parallel, and then waits for all the thread to be done. The Create Indexes section counts the time it took to create the indexes in total while the Build Completion section counts how much time we had to still wait when all the other things to do were already finished. This double accounting of sorts is more relevant in the loading from a database scenario where often enough we don't have to actually wait much for the indexes, because most of them have been already created in parallel during the other tables loading. Maybe I should review using the same time categories in the report for single-table loading here, it looks like I shared too much code... |
Looks like there are still some gremlins in this. Running again with a new dataset and it looks like the script is confused by constraints which it created last time.
|
Can you please run the following query, it might be that some indexes have been only partly deleted and that we should then not worry about them here... select indrelid::regclass, indisvalid, indcheckxmin, indisready, indislive,
pg_get_indexdef(indexrelid)
from pg_index
where indrelid = 'pages'::regclass; The other situation where I would expect your error messages is a concurrency issue where two pgloader are working in parallel against the table, thus one of those just deleted the indexes and constraint in the time between the other process having listed the indexes and wanting to delete them... Is a concurrency issue possible in your use case? |
Not sure what you mean by not worrying about them. Because they're not being managed properly the load time goes up from a minute to over 3 hours and the indexes then take longer to recreate. |
Here's the query that pgloader uses to list constraints and indexes that need to be handled, can you run it for me and paste its output here? select i.relname,
indrelid::regclass,
indrelid,
indisprimary,
indisunique,
pg_get_indexdef(indexrelid),
c.conname,
pg_get_constraintdef(c.oid)
from pg_index x
join pg_class i ON i.oid = x.indexrelid
left join pg_constraint c ON c.conindid = i.oid
where indrelid = 'pages'::regclass I though before that maybe the constraint definition that pgloader wanted to take care of where actually invalid or stray definitions, hence the errors, but it seems not to be that. Are you running several pgloader commands at once? |
No, only running a single command. Here's the result.
|
I still don't understand the error messages on the constraint that doesn't exists, because the constraint is listed here. Now, why do you have 4 times the same index? 2 load attempts with the same error on DROP I presume? |
pgloader is doing all the work so presumably it's getting something wrong when it tries to drop them and therefore goes on to create duplicates. |
Can you give me a reproducible test-case so that I can then fix this bug? An example is https://github.com/dimitri/pgloader/blob/master/test/csv-districts.load which still needs a data file, or if you can prepare one all-included take https://github.com/dimitri/pgloader/blob/master/test/csv-before-after.load as a base example. |
You can use the import script at https://bitbucket.org/charlie_x/python-httparchive/src/7f7d8a3cae1652a789096d3432e2eacbba65e05e/db/httparchive.load?at=default The relevant Postgres schema is at https://bitbucket.org/charlie_x/python-httparchive/src/7f7d8a3cae1652a789096d3432e2eacbba65e05e/db/pages.sql?at=default Data can be imported from http://httparchive.org/downloads.php (any CSV dump for pages after 2014-06-01). Import |
Thanks for the complete use case, I could fix the issue at hand, namely pgloader trying to second guess the spelling of the constraint and indexes (down casing and normalizing them as if they came from a MySQL or SQLite database). The latest patch stops this madness by having the Should be good now, in as much as I could reproduce your problem then fix it! |
Second guessing is almost always the wrong way to go but sometimes there's no choice. FWIW you might appreciate some historical context behind this import: I tried and failed to get the original MySQL improved. It would have obviated the need to work around this bottleneck index: https://code.google.com/p/httparchive/issues/detail?id=65 Any progress on my other ticket about building on OS X with MacPorts? |
Thanks for the interesting context! It's also nice to see those .load files in another Open Source project ;-) About #261 let's say that all this shared object dependency hell is some over my head. I also have #159 and #160 on my plate, more generally see Build System for a listing. I need to find a proper way to make pgloader easier to install for everyone. What normally happens is that packagers show up and do the work for each distro, like I did for debian. It's yet to happen for other OSes apparently. |
Well, in case I didn't make it clear enough: I failed in my attempt to get the schema cleaned up so I forked the site from PHP to Pyramid. Then I kept hitting MySQL's limitations so started to port to Postgres for my own reporting. I haven't got all the way to properly cloning the crawler and stats part… What is interesting, however, is how fast the MySQL import is, even with indexes on. Of course, this is done at the cost of a table lock and schema changes are very painful: table has to be dumped, altered and imported. This often leads to the disk running full. You can see how this encourages the persistence of bad design decisions: schema changes are expensive; you won't be punished for not normalising the data. Wish I could be more help with the build instructions but I'm afraid it's something I've got little experience with myself. |
I have a relatively simple import from CSV to Postgres that seems to be running quite slow. The import is of about 500000 rows with around 70 columns, mainly integers and four indices. The import currently takes around 5 hours. My hard disk will happily do over 10 MB/s. Running pgbench, at which I'm not an expert, suggests TPS between 70 and 200. Even with the lower number I'd expect the important to take about half-an-hour.
When running the import it does, indeed seem as a lot of time is spent in pgloader rather than in Postgres.
What should I be looking at to improve performance?
The text was updated successfully, but these errors were encountered: