Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Improve the reliability of alerts & reports #25239

Conversation

jfrag1
Copy link
Member

@jfrag1 jfrag1 commented Sep 8, 2023

SUMMARY

The goal of this PR is to make the alerts & reports scheduler job more reliable. See #25058 for a great description of one current reliability issue.

This PR includes two main changes to improve reliability

Adjusts the center of the window used to check cron schedules.

Currently, the following logic defines the window:

start_at = time_now - timedelta(seconds=1)
stop_at = time_now + timedelta(seconds=window_size) # window_size = 59

The consequence is that if more than a second had elapsed since the scheduler job was originally triggered, any alerts & reports supposed to be triggeredwould not be. This PR moves the center of the window 30 seconds towards the past, giving more leeway for delayed jobs

Reads current time based on when celery beat triggered the scheduler job, rather than having the scheduler job generate the current time.

As I was making the above change, I realized that if the celery queue was backed up, or there was any kind of delay between when celery beat placed the job into the queue and when it was executed, alerts & reports could be missed. It would be preferable if we could simply use the time at which celery beat triggered the job. I found that this was not currently supported natively by celery, but there is a workaround using the expires option. As noted on the celery repo, this is not a perfect solution, but I see it as much more reliable than the current approach.

For example, if a report is supposed to be sent at 12:00, but the celery queue is backed up such that the scheduler job triggered by beat isn't executed until 12:02, this change will make it so that now the report will be triggered, whereas currently it would not be.

Fixes #25058

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@john-bodley john-bodley self-requested a review September 11, 2023 16:56
@jfrag1 jfrag1 marked this pull request as ready for review September 13, 2023 00:56
@jfrag1 jfrag1 changed the title fix: Adjust the cron schedule window for alerts & reports fix: Improve the reliability of alerts & reports Sep 13, 2023
superset/config.py Outdated Show resolved Hide resolved


@pytest.mark.parametrize(
"current_dttm, cron, expected",
[
("2020-01-01T00:59:01Z", "0 1 * * *", []),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing the tests which use the zone designator for the zero UTC offset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're not removed, I just changed the timestamp format to ISO (zero UTC offset represented by +00:00). The format used is inconsequential to the cron_schedule_window function being tested, since it's just passed a datetime object.


import pytest
import pytz
from dateutil import parser
from freezegun import freeze_time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used elsewhere? If not we should remove the freezegun library.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's still used elsewhere

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love freezegun, hands off @john-bodley! 😄

@jfrag1 jfrag1 closed this Sep 13, 2023
@jfrag1 jfrag1 reopened this Sep 13, 2023
stop_at = time_now + timedelta(seconds=window_size)
time_now = triggered_at.astimezone(tz)
start_at = time_now - timedelta(seconds=window_size / 2)
stop_at = time_now + timedelta(seconds=window_size / 2)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using triggered_time which could be 2 mins ago, this stop_at(30 seconds) will be preventing the schedule from being executed right? see line 47.

I think we just need to base start_at 1 seconds before the triggered_at. Reasons being cron expression can support second level granularity. If cron job is every second and start at is 30 seconds ago, we will actually execute 30 tasks before the actual start at.

And base stop_at window_size after current_time, although I am not sure why we need that stop_at at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the stop_at will prevent anything we want to run from running. Say the job is triggered at 12:00, but run at 12:02, and the cron says the job should run at 12:00. start_at will be ~11:59:30, and stop_at will be ~12:00:30, so it will correctly find the 12:00 schedule since it's in the window.

Also I don't believe cron supports second level granularity, at least not that I've seen in Superset. IMO for Superset's use cases, getting the reliability right is much more important than supporting that level of granularity

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. Cron is at minute level. and stop_at compares to triggered_at 😂

Copy link

@zephyring zephyring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +43 to +44
start_at = time_now - timedelta(seconds=window_size / 2)
stop_at = time_now + timedelta(seconds=window_size / 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The window approach seems fundamentally brittle to me. Why can't we set start_at to the last time the task was scheduled to run instead, and keep iterating until we hit the current time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this could lead to a situation where an alert/report is triggered more times that desired. For example, if a report runs at 12:00:00, and the celery queue is backed up from 11:55:00-12:00:01, all 5-6 queued scheduler jobs could run one after another, and each would trigger the report.

With the approach on this PR, the windows in this situation would be 11:54:30-11:55:30, 11:55:30-11:56:30, etc., so it should ensure we don't miss any scheduled alerts & reports or run any more times than desired.

Let me know if I'm missing something/there's something I'm not considering

@villebro
Copy link
Member

As I was making the above change, I realized that if the celery queue was backed up, or there was any kind of delay between when celery beat placed the job into the queue and when it was executed, alerts & reports could be missed.

@jfrag1 I actually ran into this exact issue, and after investigating it further, I found out that unfortunately Celery only supports passing static variables from the scheduler to the worker 🙁 So I wasn't able to come up with a clean way of solving this. However, I think one alternative solution could be as follows:

  1. the scheduler only triggers reports to be started at some interval. Let's say once per minute
  2. this task is picked up by a worker, and it would acquire a distributed lock.
  3. Instead of using the current time, it would check from the key value store when alerts were last executed, and then execute any reports that are in the interval between last execution time and now.
  4. After this the last run would be updated, and the lock would be returned.

If another worker tries to start reports at the same time, they would silently go away, as the lock would already be taken. This would ensure the following:

  1. No reports would be missed - if the queue would be clogged up, at some point a report scheduling task would get through, and then it would be able to clear the backlog of reports
  2. duplicates would not happen, as the distributed lock would ensure that only one worker does report scheduling at a time

Thoughts @jfrag1 @zephyring ? I have code that I use internally for distributed locks on the Superset key_value store (we use it in another context, but it's been rock solid for the last 1 year or so), so I can collab on this if needed.

@jfrag1
Copy link
Member Author

jfrag1 commented Sep 15, 2023

Thanks for the input @villebro! What you outlined does sound like the cleanest solution, but I'm not sure how to implement it myself, so if you're willing to collab that'd be great.

I found out that unfortunately Celery only supports passing static variables from the scheduler to the worker

I discovered the same while looking into this, but the expires solution in this PR is kind of a workaround way to have beat pass the time it placed the job into the queue to the worker. Of course it's not perfect, but do you feel it's "good enough" for this situation?

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for this clever fix to a long standing problem!

Comment on lines +52 to +53
datetime.fromisoformat(scheduler.request.expires)
- app.config["CELERY_BEAT_SCHEDULER_EXPIRES"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not your fault, but having to do it like this just makes me cringe.. 😄 I'm hoping someone can add request.scheduled_at to Celery 😉


import pytest
import pytz
from dateutil import parser
from freezegun import freeze_time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love freezegun, hands off @john-bodley! 😄

@eschutho eschutho merged commit f672d5d into apache:master Sep 19, 2023
@eschutho eschutho deleted the jack/adjust-alerts-and-reports-cron-schedule-window branch September 19, 2023 16:40
@michael-s-molina michael-s-molina added the v3.0 Label added by the release manager to track PRs to be included in the 3.0 branch label Sep 19, 2023
eschutho pushed a commit to Superset-Community-Partners/superset that referenced this pull request Sep 21, 2023
michael-s-molina pushed a commit that referenced this pull request Sep 25, 2023
saghatelian added a commit to 10webio/superset that referenced this pull request Oct 23, 2023
* fix: is_select with UNION (apache#25290)

(cherry picked from commit bb002d6)

* fix: Add explicit ON DELETE CASCADE for dashboard_roles (apache#25320)

(cherry picked from commit d54e827)

* fix(chart): Supporting custom SQL as temporal x-axis column with filter (apache#25126)

Co-authored-by: Kamil Gabryjelski <[email protected]>

* fix: Use RLS clause instead of ID for cache key (apache#25229)

(cherry picked from commit fba66c6)

* fix: Improve the reliability of alerts & reports (apache#25239)

(cherry picked from commit f672d5d)

* fix: DashboardRoles cascade operation (apache#25349)

(cherry picked from commit a971a28)

* fix: datetime with timezone excel export (apache#25318)

Co-authored-by: Michael S. Molina <[email protected]>
(cherry picked from commit 5ebcd2a)

* fix: Workaround for Cypress ECONNRESET error (apache#25399)

(cherry picked from commit d76ff39)

* fix(sqllab): invalid persisted tab state (apache#25308) (apache#25398)

* fix: Rename on_delete parameter to ondelete (apache#25424)

(cherry picked from commit 893b45f)

* fix: preventing save button from flickering in SQL Lab (apache#25106)

(cherry picked from commit 296ff17)

* fix: chart import (apache#25425)

(cherry picked from commit a4d8f36)

* fix: swagger UI CSP error (apache#25368)

(cherry picked from commit 1716b9f)

* fix: smarter date formatter (apache#25404)

(cherry picked from commit f0080f9)

* fix(sqllab): invalid start date (apache#25437)

* fix(nativeFilters): Speed up native filters by removing unnecessary rerenders (apache#25282)

Co-authored-by: JUST.in DO IT <[email protected]>
(cherry picked from commit a0eeb4d)

* fix(SqlLab): make icon placement even (apache#25372)

(cherry picked from commit 11b49a6)

* fix: Duplicate items when pasting into Select (apache#25447)

(cherry picked from commit 7cf96cd)

* fix: update the SQLAlchemy model definition at json column for Log table (apache#25445)

(cherry picked from commit e83a76a)

* fix(helm chart): set chart appVersion to 3.0.0 (apache#25373)

* fix(mysql): handle string typed decimal results (apache#24241)

(cherry picked from commit 7eab59a)

* fix: Styles not loading because of faulty CSP setting (apache#25468)

(cherry picked from commit 0cebffd)

* fix(sqllab): error with lazy_gettext for tab titles (apache#25469)

(cherry picked from commit ddde178)

* fix: Address Mypy issue which is causing CI to fail (apache#25494)

(cherry picked from commit 36ed617)

* chore: Adds 3.0.1 CHANGELOG

* fix: Unable to sync columns when database or dataset name contains `+` (apache#25390)

(cherry picked from commit dbe0838)

* fix(sqllab): Broken query containing 'children' (apache#25490)

(cherry picked from commit b92957e)

* chore: Expand error detail on screencapture (apache#25519)

(cherry picked from commit ba541e8)

* fix: tags permissions error message (apache#25516)

(cherry picked from commit 50b0816)

* fix: Apply normalization to all dttm columns (apache#25147)

(cherry picked from commit 58fcd29)

* fix: REST API CSRF exempt list (apache#25590)

(cherry picked from commit 549abb5)

* fix(RLS): Fix Info Tooltip + Button Alignment on RLS Modal (apache#25400)

(cherry picked from commit a6d0e6f)

* fix: thubmnails loading - Talisman default config (apache#25486)

(cherry picked from commit 52f631a)

* fix(Presto): catch DatabaseError when testing Presto views (apache#25559)

Co-authored-by: Rui Zhao <[email protected]>
(cherry picked from commit be3714e)

* fix(Charts): Set max row limit + removed the option to use an empty row limit value (apache#25579)

(cherry picked from commit f556ef5)

* fix(window): unavailable localStorage and sessionStorage (apache#25599)

* fix: finestTemporalGrainFormatter (apache#25618)

(cherry picked from commit 62bffaf)

* fix: revert fix(sqllab): Force trino client async execution (apache#24859) (apache#25541)

(cherry picked from commit e56e0de)

* chore: Updates 3.0.1 CHANGELOG

* fix(sqllab): Mistitled for new tab after rename (apache#25523)

(cherry picked from commit a520124)

* fix(sqllab): template validation error within comments (apache#25626)

(cherry picked from commit b370c66)

* fix: avoid 500 errors with SQLLAB_BACKEND_PERSISTENCE (apache#25553)

(cherry picked from commit 99f79f5)

* fix(import): Make sure query context is overwritten for overwriting imports (apache#25493)

(cherry picked from commit a0a0d80)

* fix: permalink save/overwrites in explore (apache#25112)

Co-authored-by: Elizabeth Thompson <[email protected]>
(cherry picked from commit e58a3ab)

* fix(header navlinks): link navlinks to path prefix (apache#25495)

(cherry picked from commit 51c56dd)

* fix: improve upload ZIP file validation (apache#25658)

* fix: warning of nth-child (apache#23638)

(cherry picked from commit 16cc089)

* fix(dremio): Fixes issue with Dremio SQL generation for Charts with Series Limit (apache#25657)

(cherry picked from commit be82657)

---------

Co-authored-by: Beto Dealmeida <[email protected]>
Co-authored-by: John Bodley <[email protected]>
Co-authored-by: Zef Lin <[email protected]>
Co-authored-by: Kamil Gabryjelski <[email protected]>
Co-authored-by: Jack Fragassi <[email protected]>
Co-authored-by: Michael S. Molina <[email protected]>
Co-authored-by: JUST.in DO IT <[email protected]>
Co-authored-by: Jack <[email protected]>
Co-authored-by: Daniel Vaz Gaspar <[email protected]>
Co-authored-by: Stepan <[email protected]>
Co-authored-by: Corbin Bullard <[email protected]>
Co-authored-by: Gyuil Han <[email protected]>
Co-authored-by: Celalettin Calis <[email protected]>
Co-authored-by: Ville Brofeldt <[email protected]>
Co-authored-by: ʈᵃᵢ <[email protected]>
Co-authored-by: Michael S. Molina <[email protected]>
Co-authored-by: mapledan <[email protected]>
Co-authored-by: Igor Khrol <[email protected]>
Co-authored-by: Rui Zhao <[email protected]>
Co-authored-by: Fabien <[email protected]>
Co-authored-by: Hugh A. Miles II <[email protected]>
Co-authored-by: OskarNS <[email protected]>
Comment on lines +52 to +53
datetime.fromisoformat(scheduler.request.expires)
- app.config["CELERY_BEAT_SCHEDULER_EXPIRES"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi I've been checking this code, how is this solving the issue, it select datetime.utcnow() because datetime.fromisoformat(scheduler.request.expires) - app.config["CELERY_BEAT_SCHEDULER_EXPIRES"] = 0
Can you help me understand @jfrag1 ??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piyushdatazip I'm not sure I fully understand what you're asking, but datetime.fromisoformat(scheduler.request.expires) is a datetime, and app.config["CELERY_BEAT_SCHEDULER_EXPIRES"] is a timedelta, so the result of this subtraction is a datetime, not 0.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jfrag1 for replying, my bad I've understood it wrongly at first instance

cccs-rc pushed a commit to CybercentreCanada/superset that referenced this pull request Mar 6, 2024
@mistercrunch mistercrunch added 🍒 3.0.1 🍒 3.0.2 🍒 3.0.3 🍒 3.0.4 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 3.1.0 labels Mar 8, 2024
vinothkumar66 pushed a commit to vinothkumar66/superset that referenced this pull request Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/L v3.0 Label added by the release manager to track PRs to be included in the 3.0 branch 🍒 3.0.1 🍒 3.0.2 🍒 3.0.3 🍒 3.0.4 🚢 3.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scheduled emails are randomly missed
9 participants