Add acking to RESUME_AFTER_DEPENDENCY message to the coordinator #1313

matt-aitken · 2024-09-17T16:25:58Z

We were just broadcasting RESUME_AFTER_DEPENDENCY to the coordinator before. If the coordinator crashed at the wrong time.

I've added a new message called RESUME_AFTER_DEPENDENCY_WITH_ACK which requires an ack to be sent back. If it's not sent back then we nack the message in the queue with a 5 second delay.

I've verified locally that this continues runs in normal situations, and if the coordinator is down.

Summary by CodeRabbit

New Features
- Introduced a new event handler for resuming tasks after dependencies are resolved, enhancing task coordination.
- Added acknowledgment mechanism for message handling, improving error handling.
- Added a new message type for resuming tasks with acknowledgment in the coordinator messages.
Improvements
- Enhanced control flow for task resumption, ensuring stricter management of dependencies and checkpoints.
- Centralized callback structure for improved code reusability and maintainability.

…e encoded)

changeset-bot · 2024-09-17T16:26:02Z

⚠️ No Changeset found

Latest commit: 9dadee2

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2024-09-17T16:26:08Z

Walkthrough

The pull request introduces a new event handler, RESUME_AFTER_DEPENDENCY_WITH_ACK, to the TaskCoordinator class, which manages task resumption after resolving dependencies with an acknowledgment mechanism. Additionally, a new callback structure, ackCallbackResult, is defined to enhance maintainability, and existing handlers are updated to enforce stricter control over task resumption through the requiresCheckpointResumeWithMessage property.

Changes

File Path	Change Summary
`apps/coordinator/src/index.ts`	Added `RESUME_AFTER_DEPENDENCY_WITH_ACK` event handler for task resumption with acknowledgment; modified `WAIT_FOR_TASK` and `WAIT_FOR_BATCH` handlers to set `requiresCheckpointResumeWithMessage`.
`packages/core/src/v3/schemas/messages.ts`	Introduced `ackCallbackResult` for callback results, updated `PlatformToProviderMessages` to reference this new structure, and added `RESUME_AFTER_DEPENDENCY_WITH_ACK` message type to `PlatformToCoordinatorMessages`.

Sequence Diagram(s)

sequenceDiagram
    participant A as TaskCoordinator
    participant B as TaskSocket

    A->>B: Receive RESUME_AFTER_DEPENDENCY_WITH_ACK
    B->>A: Retrieve task socket
    A->>A: Check requiresCheckpointResumeWithMessage
    A->>A: Cancel checkpoint if needed
    A->>B: Emit RESUME_AFTER_DEPENDENCY
    A->>B: Send success/error response

🐰 "In the meadow, tasks resume,
With hugs of code, they find their room.
Acknowledge the flow, let errors flee,
In a dance of logs, we hop with glee!
Dependencies fade, like shadows at noon,
Together we code, a bright, joyful tune!" 🌼

Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 0efed14 and 9dadee2.

Files selected for processing (1)

apps/coordinator/src/index.ts (5 hunks)

Additional comments not posted (2)

apps/coordinator/src/index.ts (2)

165-207: LGTM!

The implementation of the new event handler RESUME_AFTER_DEPENDENCY_WITH_ACK looks good. It follows a clear and logical flow, handling different scenarios appropriately:

It correctly retrieves the task socket and handles the case when the socket is not found by returning a failure response.

The check for requiresCheckpointResumeWithMessage property ensures that the process is terminated if a checkpoint resume is necessary, preventing further execution.

If no checkpoint is required, the function correctly cancels any ongoing checkpoint, emits the RESUME_AFTER_DEPENDENCY event, and returns a success response.

The introduction of this new message type and acknowledgment mechanism enhances the reliability of message handling between the sender and the coordinator. The function integrates well with the existing codebase, utilizing the chaosMonkey.call() and #cancelCheckpoint methods appropriately.

838-848: Looks good!

The changes related to setting and unsetting the requiresCheckpointResumeWithMessage property in the WAIT_FOR_TASK and WAIT_FOR_BATCH handlers are implemented correctly:

Setting the property after a checkpoint is created ensures that the task can only resume from a checkpoint, enforcing stricter control over task resumption.

The property is set with the checkpoint location and docker information, providing the necessary details for resuming from the checkpoint.

Unsetting the property when the keepRunAlive flag is received allows the run to continue without exiting, providing flexibility in handling different scenarios.

These changes align with the introduction of the RESUME_AFTER_DEPENDENCY_WITH_ACK handler, which checks the requiresCheckpointResumeWithMessage property to determine if a checkpoint resume is necessary. They work together to ensure proper handling of checkpoint resumes and task resumption.

Also applies to: 921-931

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
-- I pushed a fix in commit <commit_id>, please review it.
-- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
-- @coderabbitai generate unit testing code for this file.
-- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
-- @coderabbitai generate interesting stats about this repository and render them as a table.
-- @coderabbitai read src/utils.ts and generate unit testing code.
-- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
-- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

apps/webapp/app/hooks/useSearchParam.ts

coderabbitai

Actionable comments posted: 3

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between a84e601 and 7256244.

Files selected for processing (5)

apps/coordinator/src/index.ts (1 hunks)
apps/webapp/app/hooks/useSearchParam.ts (1 hunks)
apps/webapp/app/v3/marqs/sharedQueueConsumer.server.ts (1 hunks)
apps/webapp/app/v3/services/resumeBatchRun.server.ts (1 hunks)
packages/core/src/v3/schemas/messages.ts (4 hunks)

Additional comments not posted (6)

apps/webapp/app/v3/services/resumeBatchRun.server.ts (1)

135-137: Improved logging for better traceability.

The addition of dependentTaskAttempt, checkpointEventId, and hasCheckpointEvent to the logging statement enhances the traceability of the service's operations. This extra context can be valuable for debugging and monitoring the behavior of the service during the resumption of dependent runs.

The boolean flag hasCheckpointEvent provides a clear indication of the checkpoint event's existence, making the logs more readable and informative.

packages/core/src/v3/schemas/messages.ts (3)

18-31: LGTM!

The ackCallbackResult discriminated union is a good choice for representing the callback result. It allows for type narrowing based on the success property and has a well-defined error object structure.

287-287: Good refactor!

Using the ackCallbackResult type for the callback field improves code reusability and maintainability. It ensures consistency in the callback structure across different message types.

520-530: Looks good!

The new RESUME_AFTER_DEPENDENCY_WITH_ACK message type enhances reliability by adding acknowledgment to the existing RESUME_AFTER_DEPENDENCY message. The message structure is consistent with other message types.

apps/coordinator/src/index.ts (1)

165-191: Approve the addition of the RESUME_AFTER_DEPENDENCY_WITH_ACK event handler.

The new event handler enhances the task coordination logic by providing a robust mechanism for acknowledging the resumption of tasks after dependencies are resolved. Key aspects include:

Retrieving the task socket using attemptFriendlyId to ensure the task can be resumed.

Comprehensive error handling and logging to provide visibility into any issues during the resumption process.

Invoking chaosMonkey.call() to introduce controlled disruption or testing, helping identify potential issues or edge cases.

Canceling any ongoing checkpoint associated with the task to ensure a clean resumption without interference from previous state.

Emitting a RESUME_AFTER_DEPENDENCY event through the task socket to signal the task to resume execution.

Returning a success response to indicate the successful initiation of the resumption process.

Overall, these changes enhance the reliability and observability of the task resumption process.

apps/webapp/app/v3/marqs/sharedQueueConsumer.server.ts (1)

728-768: Introduce robust task resumption mechanism with coordinator acknowledgment

The code changes introduce a reliable mechanism for resuming tasks after dependency resolution. Here's how it works:

The resumeMessage object is constructed with relevant information such as runId, attemptId, attemptFriendlyId, completions, and executions.

The RESUME_AFTER_DEPENDENCY_WITH_ACK event is broadcasted to the coordinator namespace using emitWithAck, passing the resumeMessage. This notifies the coordinators about the dependency resolution and waits for their acknowledgment.

The responses from the coordinators are logged for monitoring and debugging purposes.

If no responses are received within the 10-second timeout or any response indicates failure, an error is logged, and the message is nacked with a delay of 5 seconds using #nackAndDoMoreWork. This allows for retrying the operation after a short interval.

The error handling and retry mechanism ensure that the consumer can handle scenarios where the coordinators are unresponsive or encounter failures.

Consider adding more detailed logging statements to capture the specific details of the failures or timeouts for easier debugging. For example, you could log the error messages or status codes received from the coordinators.

apps/webapp/app/hooks/useSearchParam.ts

packages/core/src/v3/schemas/messages.ts

pkg-pr-new · 2024-09-17T16:35:24Z

pnpm add https://pkg.pr.new/triggerdotdev/trigger.dev/@trigger.dev/build@1313

pnpm add https://pkg.pr.new/triggerdotdev/trigger.dev@1313

pnpm add https://pkg.pr.new/triggerdotdev/trigger.dev/@trigger.dev/core@1313

pnpm add https://pkg.pr.new/triggerdotdev/trigger.dev/@trigger.dev/sdk@1313

commit: 9dadee2

…run with RESUME_AFTER_DEPENDENCY_WITH_ACK

…tResumeWithMessage to undefined

…intResumeWithMessage

matt-aitken added 4 commits September 17, 2024 11:13

Fix for run filtering not working with some special characters (doubl…

558731a

…e encoded)

Add the full dependentTaskAttempt to a ResumeBatchRunService log

7d9d653

Added RESUME_AFTER_DEPENDENCY_WITH_ACK

ed83fbe

Set the delay to 5s

7256244

matt-aitken commented Sep 17, 2024

View reviewed changes

apps/webapp/app/hooks/useSearchParam.ts Show resolved Hide resolved

matt-aitken requested a review from ericallam September 17, 2024 16:27

coderabbitai bot reviewed Sep 17, 2024

View reviewed changes

apps/webapp/app/hooks/useSearchParam.ts Show resolved Hide resolved

apps/webapp/app/hooks/useSearchParam.ts Show resolved Hide resolved

packages/core/src/v3/schemas/messages.ts Show resolved Hide resolved

nicktrn approved these changes Sep 17, 2024

View reviewed changes

matt-aitken added 3 commits September 17, 2024 22:21

If a checkpoint has been created, the coordinator won’t continue the …

2dd5d46

…run with RESUME_AFTER_DEPENDENCY_WITH_ACK

If we’re keeping the run alive then set socket.data.requiresCheckpoin…

0efed14

…tResumeWithMessage to undefined

Log out the data before and after setting socket.data.requiresCheckpo…

9dadee2

…intResumeWithMessage

matt-aitken merged commit 56a5b58 into main Sep 18, 2024
7 checks passed

matt-aitken deleted the frozen-runs-fix branch September 18, 2024 09:30

coderabbitai bot mentioned this pull request Nov 3, 2024

Prevent stuck runs after duplicate dependency resume messages #1459

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add acking to RESUME_AFTER_DEPENDENCY message to the coordinator #1313

Add acking to RESUME_AFTER_DEPENDENCY message to the coordinator #1313

matt-aitken commented Sep 17, 2024 •

edited by coderabbitai bot

Loading

changeset-bot bot commented Sep 17, 2024 •

edited

Loading

coderabbitai bot commented Sep 17, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

pkg-pr-new bot commented Sep 17, 2024 •

edited

Loading

Add acking to RESUME_AFTER_DEPENDENCY message to the coordinator #1313

Add acking to RESUME_AFTER_DEPENDENCY message to the coordinator #1313

Conversation

matt-aitken commented Sep 17, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

changeset-bot bot commented Sep 17, 2024 • edited Loading

⚠️ No Changeset found

coderabbitai bot commented Sep 17, 2024 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

pkg-pr-new bot commented Sep 17, 2024 • edited Loading

matt-aitken commented Sep 17, 2024 •

edited by coderabbitai bot

Loading

changeset-bot bot commented Sep 17, 2024 •

edited

Loading

coderabbitai bot commented Sep 17, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

pkg-pr-new bot commented Sep 17, 2024 •

edited

Loading