Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate Escalation Policy Manager to Use the New Job Queue System #4246

Open
mastercactapus opened this issue Jan 15, 2025 · 0 comments
Open
Labels
enhancement New feature or request River

Comments

@mastercactapus
Copy link
Member

What problem would you like to solve? Please describe:
The Escalation Policy (EP) Manager currently processes all escalation policies and all active alerts in a single transaction (every 5 seconds). This includes:

  • Updating ep_step_on_call_users by pulling in data from rotations (rotation_state) and schedules (schedule_on_call_users).
  • Clearing the maintenance_expires_at field of services that have expired maintenance.
  • Updating escalation_policy_state for every non-closed alert (tracking its current step, next escalation time, etc.).

This “all-at-once” approach can cause issues if any single update fails, as it blocks every escalation policy from updating. As deployments grow in size and complexity, having a single massive transaction for all EP updates becomes inefficient and prone to failure.

Describe the solution you’d like:

  • Job Queue Integration: Migrate EP updates to a job queue (the River Queue), following the same pattern used in other modules.
    • Fine-Grained Updates: Instead of updating all EPs and alerts together, schedule a job for each relevant EP or alert when it needs to be updated.
    • Event-Driven & Future Scheduling:
      • When a rotation or schedule changes, or an alert progresses to a new step, enqueue a job to update the corresponding EP step data.
      • Each job can schedule future updates if a step is set to escalate at a specific time.
    • Fallback for Missed Updates: Include a mechanism to detect and recover from missed or untracked changes (e.g., DB restores, crashes, older GoAlert versions), preventing alerts from getting “stuck” in the wrong escalation state.
  • Isolated Transactions: Each EP update should run in its own transaction, ensuring one failing update does not hold up all escalations.
  • Scalability & Resilience: By breaking work into discrete tasks, we can handle a larger number of policies and alerts without the risk of a single transaction blocking the entire escalation process.

Describe alternatives you’ve considered:

  1. All-at-Once with Interval Tuning: Continuing the current approach but adjusting intervals or partial batching still requires scanning all EPs and active alerts regularly, potentially causing blocking failures in large deployments.
  2. Batched Updates: Process subsets of EPs/alerts each cycle. While this reduces transaction size, it retains the shared loop architecture and doesn’t eliminate the risk of a single batch failure affecting multiple policies.
  3. Hybrid Event Polling: Combining an event-driven approach with intermittent sweeps for missed updates. This adds complexity and can be less reliable than a unified, queue-driven design.

Additional context:

  • Escalation Policy Manager Responsibilities:
    • On-Call Users: Mirrors how the schedules manager and rotations manager function, but specifically for escalation steps (ep_step_on_call_users).
    • Alert Tracking: Updates escalation_policy_state for alerts that are still open, moving them between steps and handling escalation timings.
    • Service Maintenance: Clears maintenance_expires_at for any service whose maintenance window has passed.
  • Relation to Other Modules:
    • Rotations (rotation_state) and schedules (schedule_on_call_users) are being migrated to the queue first.
    • The EP Manager will build on that foundation, reducing overall complexity in the on-call workflow.
  • Multiple Engine Instances: The job queue ensures only one instance processes a given job at a time, preventing duplicate work and transaction conflicts seen in the existing “all modules on one loop” approach.
@mastercactapus mastercactapus added enhancement New feature or request River labels Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request River
Projects
None yet
Development

No branches or pull requests

1 participant