Fix performance issues with delayed_jobs by extending index #3324

FloThinksPi · 2023-06-27T14:02:01Z

We found out that when having a lot of delayed_jobs rows in the table, postgresql query planner did a sequential scan of the table. This was due to the query of cc_workers to fetch new jobs being:

SELECT *
FROM "delayed_jobs"
WHERE (((("run_at" <= '2023-06-27 07:18:28.061781+0000')
AND ("locked_at" IS NULL))
(.....)
ORDER BY "priority" ASC, "run_at" ASC
LIMIT 1 FOR Update;

As an order by is used on priority the index over the columns queue, locked_at, locked_by, failed_at, run_at is not used.

This is more severe as the query above gets more and more expensive by increasing row count and is queried a lot. Every worker every few seconds does this select. This quickly can become a load issue for the database.

This change adds the priority column to the already existing index delayed_jobs_reserve. Improves query times as well as database load significantly in our tests.

I have reviewed the contributing guide
I have viewed, signed, and submitted the Contributor License Agreement
I have made this pull request to the main branch
I have run all the unit tests using bundle exec rake
I have run CF Acceptance Tests

We found out that when having a lot of delayed_jobs rows in the table, postgresql query planner did a sequential scan of the table. This was due to the query of cc_workers to fetch new jobs being: ``` SELECT * FROM "delayed_jobs" WHERE (((("run_at" <= '2023-06-27 07:18:28.061781+0000') AND ("locked_at" IS NULL)) (.....) ORDER BY "priority" ASC, "run_at" ASC LIMIT 1 FOR Update; ``` As an order by is used on `priority` the index over the columns `queue, locked_at, locked_by, failed_at, run_at` is not used. This is more severe as the query above gets more and more expensive by increasing row count and is queried a lot. Every worker every few seconds does this select. This quickly can become a load issue for the database. This change adds the `priority` column to the already existing index `delayed_jobs_reserve`. Improves query times as well as database load significantly in our tests.

Currently, failed delayed_jobs are deleted after 14d (configurable) to keep some info about failed jobs that helps debugging: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/app/jobs/runtime/failed_jobs_cleanup.rb This can still lead to very large number of delayed_jobs records that slow down DB queries working on delayed_jobs (also addressed by an index, ccng cloudfoundry#3324). Idea was to have an additional absolute limit of failed jobs so that they get deleted even before the 14d failed_jobs.cutoff_age_in_days. The limit should be configurable.

With PR #3324 [1] the 'delayed_jobs_reserve' index was changed by adding the field 'priority' which is used in the ORDER BY clause. This change re-adds the previous index, that only contains fields used in the WHERE clause of the query. Although PostgreSQL could always use the new index, there seem to be situations where the query planner decides for a sequential table scan (theory: if the number of entries is rather low). [1] #3324 Co-authored-by: Dimitar Velinov <[email protected]>

Currently, failed delayed_jobs are deleted after 14d (configurable) to keep some info about failed jobs that helps debugging: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/app/jobs/runtime/failed_jobs_cleanup.rb This can still lead to very large number of delayed_jobs records that slow down DB queries working on delayed_jobs (also addressed by an index, ccng cloudfoundry#3324). Idea was to have an additional absolute limit of failed jobs so that they get deleted even before the 14d failed_jobs.cutoff_age_in_days. The limit should be configurable. change the start_frequent_jobs method to always use all configured parameters except frequency_in_seconds, change expiration_in_seconds from positional to keyword parameter

With PR #3324 [1] the 'delayed_jobs_reserve' index was changed by adding the field 'priority' which is used in the ORDER BY clause. This change re-adds the previous index, that only contains fields used in the WHERE clause of the query. Although PostgreSQL could always use the new index, there seem to be situations where the query planner decides for a sequential table scan (theory: if the number of entries is rather low). [1] #3324 Co-authored-by: Dimitar Velinov <[email protected]>

* More aggressive cleanup of failed delayed jobs Currently, failed delayed_jobs are deleted after 14d (configurable) to keep some info about failed jobs that helps debugging: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/app/jobs/runtime/failed_jobs_cleanup.rb This can still lead to very large number of delayed_jobs records that slow down DB queries working on delayed_jobs (also addressed by an index, ccng #3324). Idea was to have an additional absolute limit of failed jobs so that they get deleted even before the 14d failed_jobs.cutoff_age_in_days. The limit should be configurable. change the start_frequent_jobs method to always use all configured parameters except frequency_in_seconds, change expiration_in_seconds from positional to keyword parameter * remove optional paramter from configs

cf-gitbot added the unscheduled label Jun 27, 2023

FloThinksPi force-pushed the fix-expensive-delayed_jobs-selects branch from e4c29fc to 840a6e1 Compare June 28, 2023 06:33

philippthun approved these changes Jun 28, 2023

View reviewed changes

FloThinksPi force-pushed the fix-expensive-delayed_jobs-selects branch from 840a6e1 to 03dc9de Compare June 28, 2023 09:49

FloThinksPi merged commit 395a177 into main Jul 6, 2023

FloThinksPi deleted the fix-expensive-delayed_jobs-selects branch July 6, 2023 13:57

kathap mentioned this pull request Jul 11, 2023

More aggressive cleanup of failed delayed jobs #3346

Merged

5 tasks

philippthun mentioned this pull request Jul 26, 2023

Add delayed_jobs_reserve_where index #3358

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix performance issues with delayed_jobs by extending index #3324

Fix performance issues with delayed_jobs by extending index #3324

FloThinksPi commented Jun 27, 2023

Fix performance issues with delayed_jobs by extending index #3324

Fix performance issues with delayed_jobs by extending index #3324

Conversation

FloThinksPi commented Jun 27, 2023