Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add acking to RESUME_AFTER_DEPENDENCY message to the coordinator #1313

Merged
merged 7 commits into from
Sep 18, 2024

Conversation

matt-aitken
Copy link
Member

@matt-aitken matt-aitken commented Sep 17, 2024

We were just broadcasting RESUME_AFTER_DEPENDENCY to the coordinator before. If the coordinator crashed at the wrong time.

I've added a new message called RESUME_AFTER_DEPENDENCY_WITH_ACK which requires an ack to be sent back. If it's not sent back then we nack the message in the queue with a 5 second delay.

I've verified locally that this continues runs in normal situations, and if the coordinator is down.

Summary by CodeRabbit

  • New Features

    • Introduced a new event handler for resuming tasks after dependencies are resolved, enhancing task coordination.
    • Added acknowledgment mechanism for message handling, improving error handling.
    • Added a new message type for resuming tasks with acknowledgment in the coordinator messages.
  • Improvements

    • Enhanced control flow for task resumption, ensuring stricter management of dependencies and checkpoints.
    • Centralized callback structure for improved code reusability and maintainability.

Copy link

changeset-bot bot commented Sep 17, 2024

⚠️ No Changeset found

Latest commit: 9dadee2

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Contributor

coderabbitai bot commented Sep 17, 2024

Walkthrough

The pull request introduces a new event handler, RESUME_AFTER_DEPENDENCY_WITH_ACK, to the TaskCoordinator class, which manages task resumption after resolving dependencies with an acknowledgment mechanism. Additionally, a new callback structure, ackCallbackResult, is defined to enhance maintainability, and existing handlers are updated to enforce stricter control over task resumption through the requiresCheckpointResumeWithMessage property.

Changes

File Path Change Summary
apps/coordinator/src/index.ts Added RESUME_AFTER_DEPENDENCY_WITH_ACK event handler for task resumption with acknowledgment; modified WAIT_FOR_TASK and WAIT_FOR_BATCH handlers to set requiresCheckpointResumeWithMessage.
packages/core/src/v3/schemas/messages.ts Introduced ackCallbackResult for callback results, updated PlatformToProviderMessages to reference this new structure, and added RESUME_AFTER_DEPENDENCY_WITH_ACK message type to PlatformToCoordinatorMessages.

Sequence Diagram(s)

sequenceDiagram
    participant A as TaskCoordinator
    participant B as TaskSocket

    A->>B: Receive RESUME_AFTER_DEPENDENCY_WITH_ACK
    B->>A: Retrieve task socket
    A->>A: Check requiresCheckpointResumeWithMessage
    A->>A: Cancel checkpoint if needed
    A->>B: Emit RESUME_AFTER_DEPENDENCY
    A->>B: Send success/error response
Loading

🐰 "In the meadow, tasks resume,
With hugs of code, they find their room.
Acknowledge the flow, let errors flee,
In a dance of logs, we hop with glee!
Dependencies fade, like shadows at noon,
Together we code, a bright, joyful tune!" 🌼


Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 0efed14 and 9dadee2.

Files selected for processing (1)
  • apps/coordinator/src/index.ts (5 hunks)
Additional comments not posted (2)
apps/coordinator/src/index.ts (2)

165-207: LGTM!

The implementation of the new event handler RESUME_AFTER_DEPENDENCY_WITH_ACK looks good. It follows a clear and logical flow, handling different scenarios appropriately:

  • It correctly retrieves the task socket and handles the case when the socket is not found by returning a failure response.
  • The check for requiresCheckpointResumeWithMessage property ensures that the process is terminated if a checkpoint resume is necessary, preventing further execution.
  • If no checkpoint is required, the function correctly cancels any ongoing checkpoint, emits the RESUME_AFTER_DEPENDENCY event, and returns a success response.

The introduction of this new message type and acknowledgment mechanism enhances the reliability of message handling between the sender and the coordinator. The function integrates well with the existing codebase, utilizing the chaosMonkey.call() and #cancelCheckpoint methods appropriately.


838-848: Looks good!

The changes related to setting and unsetting the requiresCheckpointResumeWithMessage property in the WAIT_FOR_TASK and WAIT_FOR_BATCH handlers are implemented correctly:

  • Setting the property after a checkpoint is created ensures that the task can only resume from a checkpoint, enforcing stricter control over task resumption.
  • The property is set with the checkpoint location and docker information, providing the necessary details for resuming from the checkpoint.
  • Unsetting the property when the keepRunAlive flag is received allows the run to continue without exiting, providing flexibility in handling different scenarios.

These changes align with the introduction of the RESUME_AFTER_DEPENDENCY_WITH_ACK handler, which checks the requiresCheckpointResumeWithMessage property to determine if a checkpoint resume is necessary. They work together to ensure proper handling of checkpoint resumes and task resumption.

Also applies to: 921-931


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    -- I pushed a fix in commit <commit_id>, please review it.
    -- Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    -- @coderabbitai generate unit testing code for this file.
    -- @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    -- @coderabbitai generate interesting stats about this repository and render them as a table.
    -- @coderabbitai read src/utils.ts and generate unit testing code.
    -- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    -- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between a84e601 and 7256244.

Files selected for processing (5)
  • apps/coordinator/src/index.ts (1 hunks)
  • apps/webapp/app/hooks/useSearchParam.ts (1 hunks)
  • apps/webapp/app/v3/marqs/sharedQueueConsumer.server.ts (1 hunks)
  • apps/webapp/app/v3/services/resumeBatchRun.server.ts (1 hunks)
  • packages/core/src/v3/schemas/messages.ts (4 hunks)
Additional comments not posted (6)
apps/webapp/app/v3/services/resumeBatchRun.server.ts (1)

135-137: Improved logging for better traceability.

The addition of dependentTaskAttempt, checkpointEventId, and hasCheckpointEvent to the logging statement enhances the traceability of the service's operations. This extra context can be valuable for debugging and monitoring the behavior of the service during the resumption of dependent runs.

The boolean flag hasCheckpointEvent provides a clear indication of the checkpoint event's existence, making the logs more readable and informative.

packages/core/src/v3/schemas/messages.ts (3)

18-31: LGTM!

The ackCallbackResult discriminated union is a good choice for representing the callback result. It allows for type narrowing based on the success property and has a well-defined error object structure.


287-287: Good refactor!

Using the ackCallbackResult type for the callback field improves code reusability and maintainability. It ensures consistency in the callback structure across different message types.


520-530: Looks good!

The new RESUME_AFTER_DEPENDENCY_WITH_ACK message type enhances reliability by adding acknowledgment to the existing RESUME_AFTER_DEPENDENCY message. The message structure is consistent with other message types.

apps/coordinator/src/index.ts (1)

165-191: Approve the addition of the RESUME_AFTER_DEPENDENCY_WITH_ACK event handler.

The new event handler enhances the task coordination logic by providing a robust mechanism for acknowledging the resumption of tasks after dependencies are resolved. Key aspects include:

  • Retrieving the task socket using attemptFriendlyId to ensure the task can be resumed.
  • Comprehensive error handling and logging to provide visibility into any issues during the resumption process.
  • Invoking chaosMonkey.call() to introduce controlled disruption or testing, helping identify potential issues or edge cases.
  • Canceling any ongoing checkpoint associated with the task to ensure a clean resumption without interference from previous state.
  • Emitting a RESUME_AFTER_DEPENDENCY event through the task socket to signal the task to resume execution.
  • Returning a success response to indicate the successful initiation of the resumption process.

Overall, these changes enhance the reliability and observability of the task resumption process.

apps/webapp/app/v3/marqs/sharedQueueConsumer.server.ts (1)

728-768: Introduce robust task resumption mechanism with coordinator acknowledgment

The code changes introduce a reliable mechanism for resuming tasks after dependency resolution. Here's how it works:

  1. The resumeMessage object is constructed with relevant information such as runId, attemptId, attemptFriendlyId, completions, and executions.
  2. The RESUME_AFTER_DEPENDENCY_WITH_ACK event is broadcasted to the coordinator namespace using emitWithAck, passing the resumeMessage. This notifies the coordinators about the dependency resolution and waits for their acknowledgment.
  3. The responses from the coordinators are logged for monitoring and debugging purposes.
  4. If no responses are received within the 10-second timeout or any response indicates failure, an error is logged, and the message is nacked with a delay of 5 seconds using #nackAndDoMoreWork. This allows for retrying the operation after a short interval.

The error handling and retry mechanism ensure that the consumer can handle scenarios where the coordinators are unresponsive or encounter failures.

Consider adding more detailed logging statements to capture the specific details of the failures or timeouts for easier debugging. For example, you could log the error messages or status codes received from the coordinators.

apps/webapp/app/hooks/useSearchParam.ts Show resolved Hide resolved
apps/webapp/app/hooks/useSearchParam.ts Show resolved Hide resolved
packages/core/src/v3/schemas/messages.ts Show resolved Hide resolved
Copy link

pkg-pr-new bot commented Sep 17, 2024

pnpm add https://pkg.pr.new/triggerdotdev/trigger.dev/@trigger.dev/build@1313
pnpm add https://pkg.pr.new/triggerdotdev/trigger.dev@1313
pnpm add https://pkg.pr.new/triggerdotdev/trigger.dev/@trigger.dev/core@1313
pnpm add https://pkg.pr.new/triggerdotdev/trigger.dev/@trigger.dev/sdk@1313

commit: 9dadee2

@matt-aitken matt-aitken merged commit 56a5b58 into main Sep 18, 2024
7 checks passed
@matt-aitken matt-aitken deleted the frozen-runs-fix branch September 18, 2024 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants