Fix several restore and resume bugs #1418

nicktrn · 2024-10-18T15:30:00Z

Covers at least the follow scenarios:

Fast, multiple, sequential (batch)triggerAndWait()
Specific cases of runs eternally stuck in the "frozen" (checkpointed) state
Parents of cancelled child runs being unable to resume
Better container cleanup for completed runs

Also adds detection of task exits due to SIGTERM and a helpful error message.

Summary by CodeRabbit

Release Notes

New Features
- Enhanced error handling in the RunError component with the addition of a Feedback option.
- New method to retrieve the latest checkpoint event based on restoration status.
- Improved logging and metadata capture during command execution.
- Introduced new error types and enhanced error reporting for task processing.
Bug Fixes
- Updated cleanup logic in the Checkpointer class for better robustness.
- Improved error handling and logging consistency across various services.
Chores
- Refactored logger usage in the ResumeAttemptService for better encapsulation.
- Adjusted transaction handling in the EnvironmentVariablesRepository for correct scoping.
- Enhanced error formatting with Prettier for consistency and readability.

…ssages

This reverts commit f2b5c2a.

This reverts commit 89ec5c8.

This reverts commit 11066b4.

changeset-bot · 2024-10-18T15:30:04Z

🦋 Changeset detected

Latest commit: cdbf5c6

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 9 packages

Name	Type
@trigger.dev/core	Patch
@trigger.dev/build	Patch
trigger.dev	Patch
@trigger.dev/sdk	Patch
@internal/redis-worker	Patch
@internal/zod-worker	Patch
@trigger.dev/database	Patch
@trigger.dev/otlp-importer	Patch
@internal/testcontainers	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

coderabbitai · 2024-10-18T15:30:08Z

Caution

Review failed

The pull request is closed.

Walkthrough

The pull request introduces several modifications across multiple files, primarily enhancing error handling, logging, and cleanup processes. Key changes include updates to the Checkpointer and Exec classes for improved logging and error management, adjustments in the RunError component for better error presentation, and modifications in various services to refine transaction handling and data retrieval. New methods and constants are added, while some existing method signatures are updated for improved usability and clarity.

Changes

File	Change Summary
apps/coordinator/src/checkpointer.ts	Updated `cleanup` function in `Checkpointer` class for robust abort controller removal; enhanced logging and error handling in `checkpointAndPush` method.
apps/coordinator/src/exec.ts	Enhanced `x` method in `Exec` class with new `globalOpts` object for metadata; updated output structure to include more detailed execution context.
apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx	Added imports and improved error handling in `RunError` component; checks `link.magic` to conditionally render `Feedback` component or fallback to `Callout`.
apps/webapp/app/v3/environmentVariables/environmentVariablesRepository.server.ts	Adjusted transaction handling in `create` method of `EnvironmentVariablesRepository` class for correct scoping; no functional changes.
apps/webapp/app/v3/handleSocketIo.server.ts	Modified `createCoordinatorNamespace` function to update payload retrieval in `CREATE_TASK_RUN_ATTEMPT` handler; improved error handling and logging.
apps/webapp/app/v3/marqs/sharedQueueConsumer.server.ts	Added `$replica` import and constants for task run statuses; significant changes in `doWorkInternal` method for handling message types; refactored `getExecutionPayloadFromAttempt` method to accept an object as an argument.
apps/webapp/app/v3/requeueTaskRun.server.ts	Updated `call` method in `RequeueTaskRunService` to enhance field selection for `taskRun` queries; added logic for handling completed task runs and emitting cancellation events.
apps/webapp/app/v3/services/restoreCheckpoint.server.ts	Added `getLastCheckpointEventIfUnrestored` method to `RestoreCheckpointService` class for querying checkpoint events; existing `call` method remains unchanged.
apps/webapp/app/v3/services/resumeAttempt.server.ts	Refactored logger usage in `ResumeAttemptService` class; updated `getExecutionPayloadFromAttempt` method to accept an object for better usability; removed `WaitReason` import.
packages/core/src/v3/errors.ts	Introduced new `ErrorLink` type and updated `EnhanceError` type; modified error handling logic to include new error type `TASK_PROCESS_SIGTERM`.
packages/core/src/v3/schemas/common.ts	Removed `TaskRunErrorCodes` constant and replaced it with a new declaration derived from the `TaskRunInternalError` schema; added new error codes to the `TaskRunInternalError` enum.

Possibly related PRs

Prevent abort signals from causing uncaught exceptions #1320: Enhances error handling within the Checkpointer class, related to the changes in the cleanup function.
Improve checkpoint reliability and cleanup of temp files #1367: Improves the cleanup process for checkpoint artifacts, aligning with the main PR's focus on cleanup logic in the Checkpointer class.
Prettier errors and related improvements #1387: Introduces improvements in error handling and logging, relating to the enhanced logging in the checkpointAndPush method of the main PR.
Fix updating many environment variables at once #1413: Restructures transaction logic for creating environment variables, relating to overall improvements in error handling and robustness in the Checkpointer class.

🐰 In the code, we hop and play,
With logs and errors, we clear the way.
Checkpoints restored, tasks requeued,
In our rabbit burrow, all's renewed!
So let’s celebrate with a joyful cheer,
For each little change brings us near! 🎉

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

pkg-pr-new · 2024-10-18T15:37:10Z

pnpm add https://pkg.pr.new/triggerdotdev/trigger.dev/@trigger.dev/build@1418

pnpm add https://pkg.pr.new/triggerdotdev/trigger.dev@1418

pnpm add https://pkg.pr.new/triggerdotdev/trigger.dev/@trigger.dev/core@1418

pnpm add https://pkg.pr.new/triggerdotdev/trigger.dev/@trigger.dev/sdk@1418

commit: 59d375b

coderabbitai

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (6)

apps/webapp/app/v3/services/restoreCheckpoint.server.ts (1)
116-134: Approve with suggestions for minor improvements

The new getLastCheckpointEventIfUnrestored method is a valuable addition that aligns with the PR objectives. It provides a way to retrieve the last checkpoint event, which can be useful for addressing restore and resume bugs. The implementation is concise and uses the Prisma ORM consistently with the rest of the class.

Here are a few suggestions to further improve the method:

Consider adding error handling for the database query. This will make the method more robust and easier to debug.

The method name might be slightly misleading as it doesn't explicitly check if the checkpoint is unrestored. Consider either renaming the method to something like getLastCheckpointEvent or adding a comment to explain its behavior.

Add a return type annotation to improve code clarity. For example:
async getLastCheckpointEventIfUnrestored(runId: string): Promise<CheckpointRestoreEvent | undefined> {
  // ... existing implementation ...
}
With these minor improvements, the method will be even more robust and maintainable.
apps/webapp/app/v3/services/resumeAttempt.server.ts (3)
22-22: Excellent logging improvements with a minor suggestion

The changes significantly enhance the logging mechanism:

Consistent use of the encapsulated _logger.

Creation of a child logger with rich context (attemptId, attemptFriendlyId, taskRun).

These improvements will greatly aid in debugging and tracing issues.

Consider moving the child logger creation to the beginning of the method, right after the attempt is found. This would ensure all logs within the method have the additional context:
if (!attempt) {
  this._logger.error("Could not find attempt", params);
  return;
}

this._logger = this._logger.child({
  attemptId: attempt.id,
  attemptFriendlyId: attempt.friendlyId,
  taskRun: attempt.taskRun,
});

// Rest of the method...
Also applies to: 81-81, 85-90

164-164: Great logging enhancements with a suggestion

The changes continue to improve the logging mechanism:

Consistent use of this._logger for error logging.

Creation of a child logger with additional context for each completed attempt.

These improvements will significantly aid in tracing and debugging issues related to individual completed attempts.

Consider renaming the logger variable to avoid shadowing the class property:
const completedAttemptLogger = this._logger.child({
  completedAttemptId: completedAttempt.id,
  completedAttemptFriendlyId: completedAttempt.friendlyId,
  completedRunId: completedAttempt.taskRunId,
});
This makes it clearer that we're creating a new logger instance for the completed attempt.

Also applies to: 187-187, 192-197

Line range hint 1-255: Suggestions for further improvements

The changes in this file significantly improve the logging and error handling. To further enhance the code quality and performance, consider the following suggestions:

Unit Tests: Given the complexity of the logic in this file, especially around resuming attempts and handling dependencies, it would be beneficial to add or expand unit tests. This will help ensure the correctness of the logic and make future refactoring easier.

Performance Optimization: The file contains several database operations, some of which are within loops. Consider reviewing these operations for potential performance optimizations. For example:

Can any of the database queries be combined or optimized?

Is there an opportunity to use batch operations for multiple updates?

Could any of the database operations benefit from caching frequently accessed data?

Error Handling: While the error logging has been improved, consider adding more specific error types or error codes. This could help in better categorizing and handling different types of errors that may occur during the resume process.

Would you like assistance in identifying specific areas for additional unit tests or performance optimizations?
apps/coordinator/src/exec.ts (1)
67-78: Improved metadata logging enhances debugging capabilities.

The addition of globalOpts, localOpts, and explicit output properties in the metadata object significantly improves the logging capabilities of the x method. This change provides more context for debugging and monitoring command executions.

Consider adding a timestamp field to the metadata object to further enhance debugging capabilities. This can be useful for tracking execution times and identifying potential performance issues.
metadata = {
  // ... existing fields ...
+ timestamp: new Date().toISOString(),
};
apps/webapp/app/v3/handleSocketIo.server.ts (1)
198-201: LGTM! Consider destructuring for consistency.

The changes to getExecutionPayloadFromAttempt look good. The new object parameter style improves readability and flexibility. This aligns well with the PR's objectives of addressing restore and resume bugs.

For consistency with the surrounding code style, consider using object destructuring:
const payload = await sharedQueueTasks.getExecutionPayloadFromAttempt({
  id: attempt.id,
  setToExecuting: true,
});
This minor change would align with modern JavaScript/TypeScript practices and maintain consistency with other parts of the codebase.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 235ab90 and 59d375b.

📒 Files selected for processing (11)

apps/coordinator/src/checkpointer.ts (1 hunks)
apps/coordinator/src/exec.ts (1 hunks)
apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx (3 hunks)
apps/webapp/app/v3/environmentVariables/environmentVariablesRepository.server.ts (1 hunks)
apps/webapp/app/v3/handleSocketIo.server.ts (1 hunks)
apps/webapp/app/v3/marqs/sharedQueueConsumer.server.ts (13 hunks)
apps/webapp/app/v3/requeueTaskRun.server.ts (2 hunks)
apps/webapp/app/v3/services/restoreCheckpoint.server.ts (1 hunks)
apps/webapp/app/v3/services/resumeAttempt.server.ts (6 hunks)
packages/core/src/v3/errors.ts (5 hunks)
packages/core/src/v3/schemas/common.ts (2 hunks)

✅ Files skipped from review due to trivial changes (1)

apps/webapp/app/v3/environmentVariables/environmentVariablesRepository.server.ts

🧰 Additional context used

🔇 Additional comments (21)

apps/webapp/app/v3/services/resumeAttempt.server.ts (3)

11-11: LGTM: Improved modularity and encapsulation

The changes here improve the code structure:

Importing sharedQueueTasks enhances modularity.

Introducing a private _logger property allows for better encapsulation of logging behavior.

These modifications align with good software engineering practices.

Also applies to: 17-18

92-92: LGTM: Consistent logger usage

These changes consistently apply the use of the encapsulated _logger throughout the method. This ensures that all log messages benefit from the additional context provided by the child logger, improving traceability and debugging capabilities.

Also applies to: 100-100, 117-117, 123-123, 137-137, 143-143

203-203: Good improvements with a request for clarification

The changes enhance the code in several ways:

Consistent use of the context-rich logger for error logging.

Updated method call to getExecutionPayloadFromAttempt with more explicit parameters.

Could you please clarify the implications of the skipStatusChecks: true parameter? While the comment suggests this is an optimization, it would be helpful to understand:

What status checks are being skipped?

Are there any potential risks associated with skipping these checks?

Is there a way to verify that skipping these checks is always safe in this context?

Consider adding a more detailed comment explaining the rationale behind this optimization.

Also applies to: 210-213, 216-216

packages/core/src/v3/schemas/common.ts (2)

92-92: Approved: New error codes enhance error handling capabilities

The addition of "TASK_PROCESS_SIGTERM" and "TASK_RUN_HEARTBEAT_TIMEOUT" to the TaskRunInternalError enum aligns well with the PR objectives. These new error codes will improve the system's ability to detect and handle specific error scenarios, such as task exits due to SIGTERM signals and potential timeout issues that could lead to "frozen" states.

Also applies to: 98-98

112-113: Approved: Improved error code management

The redefinition of TaskRunErrorCodes to derive its values directly from TaskRunInternalError.shape.code.enum is a smart refactoring. This change ensures that TaskRunErrorCodes always stays in sync with the error codes defined in TaskRunInternalError, reducing the risk of inconsistencies and simplifying future maintenance.

apps/webapp/app/v3/handleSocketIo.server.ts (1)

Line range hint 1-424: Changes align well with PR objectives

The modification to the CREATE_TASK_RUN_ATTEMPT handler in this file is a targeted change that aligns well with the PR's objective of addressing restore and resume bugs. The update to getExecutionPayloadFromAttempt appears to be part of a broader effort to improve the handling of task run attempts, which could contribute to resolving issues with runs becoming stuck or failing to resume properly.

The localized nature of the change minimizes the risk of unintended side effects while potentially improving the system's ability to manage task run attempts effectively. This update seems to be a positive step towards achieving the goals outlined in the PR summary.
apps/webapp/app/v3/requeueTaskRun.server.ts (2)
9-9: Import of 'socketIo' added correctly

The import statement for socketIo is properly added, enabling socket communication within this file.

107-107: Review the logic for 'delayInMs' computation

The expression for delayInMs:
delayInMs: taskRun.lockedToVersion?.supportsLazyAttempts ? 5_000 : undefined,
implies that if supportsLazyAttempts is true, delayInMs is 5000, otherwise it's undefined. Confirm that this aligns with the intended behavior, especially if supportsLazyAttempts can be undefined.

If the intention is to have no delay when supportsLazyAttempts is false or undefined, this logic is correct. Otherwise, you might consider setting a default value.
packages/core/src/v3/errors.ts (1)

389-396: Consistent use of optional chaining for error.message

In this block, you've correctly used optional chaining with error.message?.includes("SIGTERM"). This ensures that if error.message is undefined, it won't cause a runtime error.

apps/coordinator/src/checkpointer.ts (1)

439-442: Prevent unintended removal of controllers in cleanup

By checking that this.#abortControllers.get(runId) equals the current controller before deleting, you ensure that only the controller associated with the current operation is removed. This prevents accidentally deleting controllers that may still be in use by other processes.

apps/webapp/app/v3/marqs/sharedQueueConsumer.server.ts (11)

24-24: Usage of $replica for read operations

Introducing $replica for read operations in the #heartbeat method is appropriate to reduce the load on the primary database. Make sure that any potential for stale data does not negatively impact the application's behavior.

46-51: Imported final status constants and utility functions correctly

The constants FINAL_ATTEMPT_STATUSES, FINAL_RUN_STATUSES, and the functions isFinalAttemptStatus, isFinalRunStatus are correctly imported from ../taskStatus and are used appropriately in status checks throughout the code.

628-630: Proper status check using notIn with FINAL_RUN_STATUSES

The query correctly filters out runs with final statuses using status: { notIn: FINAL_RUN_STATUSES }, ensuring only active runs are processed.

644-650: Verify resuming runs not in 'EXECUTING' status

The code logs a warning when resumableRun.status is not 'EXECUTING' but proceeds to attempt a resume. Ensure that resuming runs in other statuses won't lead to unexpected behavior or conflicts.

Consider verifying if additional statuses should be handled differently before attempting to resume.

Line range hint 737-870: Enhanced error handling and retry logic during resumption

The updates improve error handling by logging detailed warnings and attempting to restore checkpoints when resumption fails. Ensure that the retry mechanism does not result in infinite loops in cases of persistent failures.

Consider implementing a retry limit to prevent potential infinite loops due to continuous failures.

995-995: Correct usage of FINAL_ATTEMPT_STATUSES in status check

The condition status: { in: FINAL_ATTEMPT_STATUSES } accurately checks for attempts with final statuses, ensuring only completed attempts are processed further.

1041-1051: Updated method signature with named parameters

The getExecutionPayloadFromAttempt method now accepts an object with named parameters, including the new skipStatusChecks option. This enhances the method's flexibility and readability.

1084-1106: Conditional status checks based on skipStatusChecks

Introducing the skipStatusChecks flag allows conditional execution of status validations. The status checks are properly scoped within the if (!skipStatusChecks) block, and the switch cases are correctly structured.

1257-1261: Adjusted method call to align with updated signature

The call to getExecutionPayloadFromAttempt now uses named parameters, aligning with the updated method signature. This change improves code clarity and reduces the risk of passing incorrect arguments.

1336-1342: Refactored heartbeat logic into a private method #heartbeat

Consolidating the heartbeat functionality into the #heartbeat private method reduces code duplication and simplifies maintenance. The methods taskHeartbeat and taskRunHeartbeat appropriately delegate to this new method.

1353-1411: Efficient heartbeat handling with read replica and run cancellation

The #heartbeat method efficiently uses $replica to reduce load on the primary database. It correctly handles final run statuses by emitting a REQUEST_RUN_CANCELLATION event, ensuring that leftover processes are terminated appropriately.

...s.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx

apps/webapp/app/v3/requeueTaskRun.server.ts

packages/core/src/v3/errors.ts

* try to correct resume messages with missing checkpoint * prevent creating checkpoints for outdated task waits * prevent creating checkpoints for outdated batch waits * use heartbeats to check for and clean up any leftover containers * lint * improve exec logging * improve resume attempt logs * fix for resuming parents of canceled child runs * separate SIGTERM from maybe OOM errors * pretty errors can have magic dashboard links * prevent uncancellable checkpoints * simplify task run error code enum export * grab the last, not the first child run * Revert "prevent creating checkpoints for outdated batch waits" This reverts commit f2b5c2a. * Revert "grab the last, not the first child run" This reverts commit 89ec5c8. * Revert "prevent creating checkpoints for outdated task waits" This reverts commit 11066b4. * more logs for resume message handling * add magic error link comment * add changeset

* refactor finalize run service * refactor complete attempt service * remove separate graceful exit handling * refactor task status helpers * clearly separate statuses in prisma schema * all non-final statuses should be failable * new import payload error code * store default retry config if none set on task * failed run service now respects retries * fix merged task retry config indexing * some errors should never be retried * finalize run service takes care of acks now * execution payload helper now with single object arg * internal error code enum export * unify failed and crashed run retries * Prevent uncaught socket ack exceptions (#1415) * catch all the remaining socket acks that could possibly throw * wrap the remaining handlers in try catch * New onboarding question (#1404) * Updated “Twitter” to be “X (Twitter)” * added Textarea to storybook * Updated textarea styling to match input field * WIP adding new text field to org creation page * Added description to field * Submit feedback to Plain when an org signs up * Formatting improvement * type improvement * removed userId * Moved submitting to Plain into its own file * Change orgName with name * use sendToPlain function for the help & feedback email form * use name not orgName * import cleanup * Downgrading plan form uses sendToPlain * Get the userId from requireUser only * Added whitespace-pre-wrap to the message property on the run page * use requireUserId * Removed old Plain submit code * Added a new Context page for the docs (#1416) * Added a new context page with task context properties * Removed code comments * Added more crosslinks * Fix updating many environment variables at once (#1413) * Move code example to the side menu * New docs example for creating a HN email summary * doc: add instructions to create new reference project and run it locally (#1417) * doc: add instructions to create new reference project and run it locally * doc: Add instruction for running tunnel * minor language improvement * Fix several restore and resume bugs (#1418) * try to correct resume messages with missing checkpoint * prevent creating checkpoints for outdated task waits * prevent creating checkpoints for outdated batch waits * use heartbeats to check for and clean up any leftover containers * lint * improve exec logging * improve resume attempt logs * fix for resuming parents of canceled child runs * separate SIGTERM from maybe OOM errors * pretty errors can have magic dashboard links * prevent uncancellable checkpoints * simplify task run error code enum export * grab the last, not the first child run * Revert "prevent creating checkpoints for outdated batch waits" This reverts commit f2b5c2a. * Revert "grab the last, not the first child run" This reverts commit 89ec5c8. * Revert "prevent creating checkpoints for outdated task waits" This reverts commit 11066b4. * more logs for resume message handling * add magic error link comment * add changeset * chore: Update version for release (#1410) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * Release 3.0.13 * capture ffmpeg oom errors * respect maxAttempts=1 when failing before first attempt creation * request worker exit on fatal errors * fix error code merge * add new error code to should retry * pretty segfault errors * pretty internal errors for attempt spans * decrease oom false positives * fix timeline event color for failed runs * auto-retry packet import and export * add sdk version check and complete event while completing attempt * all internal errors become crashes by default * use pretty error helpers exclusively * error to debug log * zodfetch fixes * rename import payload to task input error * fix true non-zero exit error display * fix retry config parsing * correctly mark crashes as crashed * add changeset * remove non-zero exit comment * pretend we don't support default default retry configs yet --------- Co-authored-by: James Ritchie <[email protected]> Co-authored-by: shubham yadav <[email protected]> Co-authored-by: Tarun Pratap Singh <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

nicktrn added 19 commits October 17, 2024 09:31

try to correct resume messages with missing checkpoint

623f158

prevent creating checkpoints for outdated task waits

11066b4

prevent creating checkpoints for outdated batch waits

f2b5c2a

use heartbeats to check for and clean up any leftover containers

d756a16

Merge remote-tracking branch 'origin/main' into fix/invalid-resume-me…

5364558

…ssages

lint

df15d6a

improve exec logging

e003d25

improve resume attempt logs

9af6018

fix for resuming parents of canceled child runs

4c8618d

separate SIGTERM from maybe OOM errors

12ad920

Merge remote-tracking branch 'origin/main' into fix/resume-restore-bugs

13faa69

pretty errors can have magic dashboard links

a9928be

prevent uncancellable checkpoints

2d84b7c

simplify task run error code enum export

34d9759

grab the last, not the first child run

89ec5c8

Revert "prevent creating checkpoints for outdated batch waits"

5c262fd

This reverts commit f2b5c2a.

Revert "grab the last, not the first child run"

e6afbb4

This reverts commit 89ec5c8.

Revert "prevent creating checkpoints for outdated task waits"

40d80f9

This reverts commit 11066b4.

more logs for resume message handling

59d375b

coderabbitai bot reviewed Oct 18, 2024

View reviewed changes

nicktrn added 2 commits October 18, 2024 16:43

add magic error link comment

3604d83

add changeset

cdbf5c6

nicktrn merged commit 90593ad into main Oct 18, 2024
0 of 7 checks passed

nicktrn deleted the fix/resume-restore-bugs branch October 18, 2024 15:46

github-actions bot mentioned this pull request Oct 18, 2024

chore: Update version for release #1410

Merged

coderabbitai bot mentioned this pull request Oct 22, 2024

Automatically reattempt after internal errors #1424

Merged

7 tasks

This was referenced Oct 23, 2024

fix metadata system not updating because of dual package hazard #1428

Merged

fix: removing schedule instance no longer crashes #1430

Merged

This was referenced Oct 28, 2024

Fixes for internal error reattempts #1436

Merged

Pass init output to all error handlers #1441

Merged

Improve resume reliability of restored runs #1458

Merged

This was referenced Nov 19, 2024

Upgrade zod to 3.23.8 #1484

Merged

fix: efficient task trigger queue updates #1489

Closed

Fix: trigger queue updates #1491

Merged

This was referenced Nov 29, 2024

Use one-time use tokens when triggering or batch triggering from the frontend #1515

Merged

Feat: Add button to try and resume pending batches #1529

Merged

Fix cancelled runs breaking realtime subscriptions #1533

Merged

coderabbitai bot mentioned this pull request Dec 11, 2024

Changes frozen to waiting #1551

Merged

This was referenced Dec 19, 2024

Focus visible style refinements #1577

Merged

Add Support for Alternate Email Transports #1580

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix several restore and resume bugs #1418

Fix several restore and resume bugs #1418

nicktrn commented Oct 18, 2024 •

edited by coderabbitai bot

Loading

changeset-bot bot commented Oct 18, 2024 •

edited

Loading

coderabbitai bot commented Oct 18, 2024 •

edited

Loading

Review failed

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

pkg-pr-new bot commented Oct 18, 2024

coderabbitai bot left a comment

Fix several restore and resume bugs #1418

Fix several restore and resume bugs #1418

Conversation

nicktrn commented Oct 18, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

Release Notes

changeset-bot bot commented Oct 18, 2024 • edited Loading

🦋 Changeset detected

coderabbitai bot commented Oct 18, 2024 • edited Loading

Review failed

Walkthrough

Changes

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

pkg-pr-new bot commented Oct 18, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

nicktrn commented Oct 18, 2024 •

edited by coderabbitai bot

Loading

changeset-bot bot commented Oct 18, 2024 •

edited

Loading

coderabbitai bot commented Oct 18, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)