You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What problem would you like to solve? Please describe:
The Escalation Policy (EP) Manager currently processes all escalation policies and all active alerts in a single transaction (every 5 seconds). This includes:
Updating ep_step_on_call_users by pulling in data from rotations (rotation_state) and schedules (schedule_on_call_users).
Clearing the maintenance_expires_at field of services that have expired maintenance.
Updating escalation_policy_state for every non-closed alert (tracking its current step, next escalation time, etc.).
This “all-at-once” approach can cause issues if any single update fails, as it blocks every escalation policy from updating. As deployments grow in size and complexity, having a single massive transaction for all EP updates becomes inefficient and prone to failure.
Describe the solution you’d like:
Job Queue Integration: Migrate EP updates to a job queue (the River Queue), following the same pattern used in other modules.
Fine-Grained Updates: Instead of updating all EPs and alerts together, schedule a job for each relevant EP or alert when it needs to be updated.
Event-Driven & Future Scheduling:
When a rotation or schedule changes, or an alert progresses to a new step, enqueue a job to update the corresponding EP step data.
Each job can schedule future updates if a step is set to escalate at a specific time.
Fallback for Missed Updates: Include a mechanism to detect and recover from missed or untracked changes (e.g., DB restores, crashes, older GoAlert versions), preventing alerts from getting “stuck” in the wrong escalation state.
Isolated Transactions: Each EP update should run in its own transaction, ensuring one failing update does not hold up all escalations.
Scalability & Resilience: By breaking work into discrete tasks, we can handle a larger number of policies and alerts without the risk of a single transaction blocking the entire escalation process.
Describe alternatives you’ve considered:
All-at-Once with Interval Tuning: Continuing the current approach but adjusting intervals or partial batching still requires scanning all EPs and active alerts regularly, potentially causing blocking failures in large deployments.
Batched Updates: Process subsets of EPs/alerts each cycle. While this reduces transaction size, it retains the shared loop architecture and doesn’t eliminate the risk of a single batch failure affecting multiple policies.
Hybrid Event Polling: Combining an event-driven approach with intermittent sweeps for missed updates. This adds complexity and can be less reliable than a unified, queue-driven design.
Additional context:
Escalation Policy Manager Responsibilities:
On-Call Users: Mirrors how the schedules manager and rotations manager function, but specifically for escalation steps (ep_step_on_call_users).
Alert Tracking: Updates escalation_policy_state for alerts that are still open, moving them between steps and handling escalation timings.
Service Maintenance: Clears maintenance_expires_at for any service whose maintenance window has passed.
Relation to Other Modules:
Rotations (rotation_state) and schedules (schedule_on_call_users) are being migrated to the queue first.
The EP Manager will build on that foundation, reducing overall complexity in the on-call workflow.
Multiple Engine Instances: The job queue ensures only one instance processes a given job at a time, preventing duplicate work and transaction conflicts seen in the existing “all modules on one loop” approach.
The text was updated successfully, but these errors were encountered:
What problem would you like to solve? Please describe:
The Escalation Policy (EP) Manager currently processes all escalation policies and all active alerts in a single transaction (every 5 seconds). This includes:
ep_step_on_call_users
by pulling in data from rotations (rotation_state
) and schedules (schedule_on_call_users
).maintenance_expires_at
field of services that have expired maintenance.escalation_policy_state
for every non-closed alert (tracking its current step, next escalation time, etc.).This “all-at-once” approach can cause issues if any single update fails, as it blocks every escalation policy from updating. As deployments grow in size and complexity, having a single massive transaction for all EP updates becomes inefficient and prone to failure.
Describe the solution you’d like:
Describe alternatives you’ve considered:
Additional context:
ep_step_on_call_users
).escalation_policy_state
for alerts that are still open, moving them between steps and handling escalation timings.maintenance_expires_at
for any service whose maintenance window has passed.rotation_state
) and schedules (schedule_on_call_users
) are being migrated to the queue first.The text was updated successfully, but these errors were encountered: