Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MaintenanceTasks.stuck_task_duration configuration #949

Merged
merged 1 commit into from
Jan 15, 2024

Conversation

nvasilevski
Copy link
Contributor

Having STUCK_TASK_TIMEOUT as a constant is not helpful since the actual value will be highly dependent on the job infrastructure of the applications. Simple setups may consider the task being stuck after no updates for ~ 1 minute while more complex and distributed systems with job retries and backoffs may legitimately prefer higher threshold to account for natural delays in a distributed system

It's a breaking change for applications that relied on STUCK_TASK_TIMEOUT though the fix is going to be pretty straightforward.

What reviewers should focus on

Is MaintenanceTasks module a right place for the config to live? One common alternative would be to define it on the Task class which will allow redefining it on per-task basis. But I don't think this would make sense and the behavior is job-infra specific rather than task-specific.

@nvasilevski nvasilevski requested a review from a team January 12, 2024 18:15
Copy link
Contributor

@adrianna-chang-shopify adrianna-chang-shopify left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this as a feature. I'm not a huge fan of the config name, but can't think of anything more appropriate right now. Maybe someone else on the team has ideas? threshold seems a bit verbose 😅 Maybe even stuck_task_duration or stuck_task_period to imply it's a duration?

@@ -32,7 +32,6 @@ class Run < ApplicationRecord
:cancelled,
]
COMPLETED_STATUSES = [:succeeded, :errored, :cancelled]
STUCK_TASK_TIMEOUT = 5.minutes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about using #deprecate_constant, but MaintenanceTasks::Run is @api private, so I don't think we need to deprecate 👍 (Apps shouldn't be referencing this).

Comment on lines 95 to 98
# The threshold in seconds after which a task is considered stuck.
#
# @return [Integer] the threshold in seconds after which a task is considered stuck.
mattr_accessor :task_stuck_threshold, default: 5.minutes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it can be an integer since we call #ago on it in #stuck? -- I think it needs to be an AS::Duration.

Suggested change
# The threshold in seconds after which a task is considered stuck.
#
# @return [Integer] the threshold in seconds after which a task is considered stuck.
mattr_accessor :task_stuck_threshold, default: 5.minutes
# The duration after which a task is considered stuck and can be force cancelled.
#
# @return [ActiveSupport::Duration, Numeric] duration after which a task is considered stuck.
mattr_accessor :task_stuck_threshold, default: 5.minutes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, in your suggestion you are still adding Numeric, does it mean we want to support both

MaintenanceTasks.task_stuck_threshold = 60 # seconds implied 
MaintenanceTasks.task_stuck_threshold = 1.minute

I think I like the API being flexible, but perhaps instead of the return value we need to define writer so writer will accept several inputs but will always turn it into ActiveSupport::Duration. Does this sound reasonable?

Or maybe to avoid confusion we should choose one and that's it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead with ActiveSupport::Duration only as I feel like its unlikely application will prefer setting this value in seconds like MaintenanceTasks.task_stuck_threshold = 300 # seconds implied

Copy link
Member

@gmcgibbon gmcgibbon Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just a duration is acceptable. Supporting multiple values isn't needed if we only use it one way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that SGTM 👍

@nvasilevski
Copy link
Contributor Author

@adrianna-chang-shopify

I like stuck_task_duration, let me make a change

@nvasilevski nvasilevski force-pushed the allow-stuck-time-threshold-to-be-configurable branch from afca161 to 2bdbb95 Compare January 12, 2024 20:06
Copy link
Contributor

@adrianna-chang-shopify adrianna-chang-shopify left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Can we add a section to the README under https://github.com/Shopify/maintenance_tasks?tab=readme-ov-file#configuring-the-gem to document this option before we merge?

@nvasilevski nvasilevski changed the title Add MaintenanceTasks.task_stuck_threshold configuration Add MaintenanceTasks.stuck_task_duration configuration Jan 15, 2024
@nvasilevski nvasilevski force-pushed the allow-stuck-time-threshold-to-be-configurable branch from 2bdbb95 to c56a2a4 Compare January 15, 2024 16:01
@nvasilevski
Copy link
Contributor Author

Added a section in the README. I'll merge and start preparing a release. But feel free to tweak the README entry anytime

@nvasilevski nvasilevski merged commit a150fbc into main Jan 15, 2024
41 checks passed
@nvasilevski nvasilevski deleted the allow-stuck-time-threshold-to-be-configurable branch January 15, 2024 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants