Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance issues with delayed_jobs by extending index #3324

Merged
merged 1 commit into from
Jul 6, 2023

Conversation

FloThinksPi
Copy link
Member

We found out that when having a lot of delayed_jobs rows in the table, postgresql query planner did a sequential scan of the table. This was due to the query of cc_workers to fetch new jobs being:

SELECT *
FROM "delayed_jobs"
WHERE (((("run_at" <= '2023-06-27 07:18:28.061781+0000')
AND ("locked_at" IS NULL))
(.....)
ORDER BY "priority" ASC, "run_at" ASC
LIMIT 1 FOR Update;

As an order by is used on priority the index over the columns queue, locked_at, locked_by, failed_at, run_at is not used.

This is more severe as the query above gets more and more expensive by increasing row count and is queried a lot. Every worker every few seconds does this select. This quickly can become a load issue for the database.

This change adds the priority column to the already existing index delayed_jobs_reserve. Improves query times as well as database load significantly in our tests.

  • I have reviewed the contributing guide

  • I have viewed, signed, and submitted the Contributor License Agreement

  • I have made this pull request to the main branch

  • I have run all the unit tests using bundle exec rake

  • I have run CF Acceptance Tests

@FloThinksPi FloThinksPi force-pushed the fix-expensive-delayed_jobs-selects branch from e4c29fc to 840a6e1 Compare June 28, 2023 06:33
We found out that when having a lot of delayed_jobs rows in the table,
postgresql query planner did a sequential scan of the table.
This was due to the query of cc_workers to fetch new jobs being:

```
SELECT *
FROM "delayed_jobs"
WHERE (((("run_at" <= '2023-06-27 07:18:28.061781+0000')
AND ("locked_at" IS NULL))
(.....)
ORDER BY "priority" ASC, "run_at" ASC
LIMIT 1 FOR Update;
```

As an order by is used on `priority` the index over the columns
`queue, locked_at, locked_by, failed_at, run_at` is not used.

This is more severe as the query above gets more and more expensive
by increasing row count and is queried a lot. Every worker every few
seconds does this select. This quickly can become a load issue for the
database.

This change adds the `priority` column to the already existing
index `delayed_jobs_reserve`. Improves query times as well as
database load significantly in our tests.
@FloThinksPi FloThinksPi force-pushed the fix-expensive-delayed_jobs-selects branch from 840a6e1 to 03dc9de Compare June 28, 2023 09:49
@FloThinksPi FloThinksPi merged commit 395a177 into main Jul 6, 2023
@FloThinksPi FloThinksPi deleted the fix-expensive-delayed_jobs-selects branch July 6, 2023 13:57
kathap added a commit to sap-contributions/cloud_controller_ng that referenced this pull request Jul 11, 2023
Currently, failed delayed_jobs are deleted after 14d (configurable) to keep some info about failed jobs that helps debugging: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/app/jobs/runtime/failed_jobs_cleanup.rb

This can still lead to very large number of delayed_jobs records that slow down DB queries working on delayed_jobs (also addressed by an index, ccng cloudfoundry#3324). Idea was to have an additional absolute limit of failed jobs so that they get deleted even before the 14d failed_jobs.cutoff_age_in_days. The limit should be configurable.
kathap added a commit to sap-contributions/cloud_controller_ng that referenced this pull request Jul 18, 2023
Currently, failed delayed_jobs are deleted after 14d (configurable) to keep some info about failed jobs that helps debugging: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/app/jobs/runtime/failed_jobs_cleanup.rb

This can still lead to very large number of delayed_jobs records that slow down DB queries working on delayed_jobs (also addressed by an index, ccng cloudfoundry#3324). Idea was to have an additional absolute limit of failed jobs so that they get deleted even before the 14d failed_jobs.cutoff_age_in_days. The limit should be configurable.
kathap added a commit to sap-contributions/cloud_controller_ng that referenced this pull request Jul 18, 2023
Currently, failed delayed_jobs are deleted after 14d (configurable) to keep some info about failed jobs that helps debugging: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/app/jobs/runtime/failed_jobs_cleanup.rb

This can still lead to very large number of delayed_jobs records that slow down DB queries working on delayed_jobs (also addressed by an index, ccng cloudfoundry#3324). Idea was to have an additional absolute limit of failed jobs so that they get deleted even before the 14d failed_jobs.cutoff_age_in_days. The limit should be configurable.
kathap added a commit to sap-contributions/cloud_controller_ng that referenced this pull request Jul 18, 2023
Currently, failed delayed_jobs are deleted after 14d (configurable) to keep some info about failed jobs that helps debugging: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/app/jobs/runtime/failed_jobs_cleanup.rb

This can still lead to very large number of delayed_jobs records that slow down DB queries working on delayed_jobs (also addressed by an index, ccng cloudfoundry#3324). Idea was to have an additional absolute limit of failed jobs so that they get deleted even before the 14d failed_jobs.cutoff_age_in_days. The limit should be configurable.
kathap added a commit to sap-contributions/cloud_controller_ng that referenced this pull request Jul 19, 2023
Currently, failed delayed_jobs are deleted after 14d (configurable) to keep some info about failed jobs that helps debugging: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/app/jobs/runtime/failed_jobs_cleanup.rb

This can still lead to very large number of delayed_jobs records that slow down DB queries working on delayed_jobs (also addressed by an index, ccng cloudfoundry#3324). Idea was to have an additional absolute limit of failed jobs so that they get deleted even before the 14d failed_jobs.cutoff_age_in_days. The limit should be configurable.
kathap added a commit to sap-contributions/cloud_controller_ng that referenced this pull request Jul 20, 2023
Currently, failed delayed_jobs are deleted after 14d (configurable) to keep some info about failed jobs that helps debugging: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/app/jobs/runtime/failed_jobs_cleanup.rb

This can still lead to very large number of delayed_jobs records that slow down DB queries working on delayed_jobs (also addressed by an index, ccng cloudfoundry#3324). Idea was to have an additional absolute limit of failed jobs so that they get deleted even before the 14d failed_jobs.cutoff_age_in_days. The limit should be configurable.
kathap added a commit to sap-contributions/cloud_controller_ng that referenced this pull request Jul 20, 2023
Currently, failed delayed_jobs are deleted after 14d (configurable) to keep some info about failed jobs that helps debugging: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/app/jobs/runtime/failed_jobs_cleanup.rb

This can still lead to very large number of delayed_jobs records that slow down DB queries working on delayed_jobs (also addressed by an index, ccng cloudfoundry#3324). Idea was to have an additional absolute limit of failed jobs so that they get deleted even before the 14d failed_jobs.cutoff_age_in_days. The limit should be configurable.
kathap added a commit to sap-contributions/cloud_controller_ng that referenced this pull request Jul 21, 2023
Currently, failed delayed_jobs are deleted after 14d (configurable) to keep some info about failed jobs that helps debugging: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/app/jobs/runtime/failed_jobs_cleanup.rb

This can still lead to very large number of delayed_jobs records that slow down DB queries working on delayed_jobs (also addressed by an index, ccng cloudfoundry#3324). Idea was to have an additional absolute limit of failed jobs so that they get deleted even before the 14d failed_jobs.cutoff_age_in_days. The limit should be configurable.
philippthun added a commit that referenced this pull request Jul 26, 2023
With PR #3324 [1] the 'delayed_jobs_reserve' index was changed by adding the field 'priority' which is used in the ORDER BY clause. This change re-adds the previous index, that only contains fields used in the WHERE clause of the query. Although PostgreSQL could always use the new index, there seem to be situations where the query planner decides for a sequential table scan (theory: if the number of entries is rather low).

[1] #3324

Co-authored-by: Dimitar Velinov <[email protected]>
kathap added a commit to sap-contributions/cloud_controller_ng that referenced this pull request Jul 27, 2023
Currently, failed delayed_jobs are deleted after 14d (configurable) to keep some info about failed jobs that helps debugging: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/app/jobs/runtime/failed_jobs_cleanup.rb

This can still lead to very large number of delayed_jobs records that slow down DB queries working on delayed_jobs (also addressed by an index, ccng cloudfoundry#3324). Idea was to have an additional absolute limit of failed jobs so that they get deleted even before the 14d failed_jobs.cutoff_age_in_days. The limit should be configurable.

change the start_frequent_jobs method to always use all configured parameters except frequency_in_seconds, change expiration_in_seconds from positional to keyword parameter
johha pushed a commit that referenced this pull request Jul 27, 2023
With PR #3324 [1] the 'delayed_jobs_reserve' index was changed by adding the field 'priority' which is used in the ORDER BY clause. This change re-adds the previous index, that only contains fields used in the WHERE clause of the query. Although PostgreSQL could always use the new index, there seem to be situations where the query planner decides for a sequential table scan (theory: if the number of entries is rather low).

[1] #3324

Co-authored-by: Dimitar Velinov <[email protected]>
johha pushed a commit that referenced this pull request Jul 31, 2023
* More aggressive cleanup of failed delayed jobs

Currently, failed delayed_jobs are deleted after 14d (configurable) to keep some info about failed jobs that helps debugging: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/app/jobs/runtime/failed_jobs_cleanup.rb

This can still lead to very large number of delayed_jobs records that slow down DB queries working on delayed_jobs (also addressed by an index, ccng #3324). Idea was to have an additional absolute limit of failed jobs so that they get deleted even before the 14d failed_jobs.cutoff_age_in_days. The limit should be configurable.

change the start_frequent_jobs method to always use all configured parameters except frequency_in_seconds, change expiration_in_seconds from positional to keyword parameter

* remove optional paramter from configs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants