Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix several restore and resume bugs #1418

Merged
merged 21 commits into from
Oct 18, 2024
Merged

Fix several restore and resume bugs #1418

merged 21 commits into from
Oct 18, 2024

Conversation

nicktrn
Copy link
Collaborator

@nicktrn nicktrn commented Oct 18, 2024

Covers at least the follow scenarios:

  • Fast, multiple, sequential (batch)triggerAndWait()
  • Specific cases of runs eternally stuck in the "frozen" (checkpointed) state
  • Parents of cancelled child runs being unable to resume
  • Better container cleanup for completed runs

Also adds detection of task exits due to SIGTERM and a helpful error message.

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced error handling in the RunError component with the addition of a Feedback option.
    • New method to retrieve the latest checkpoint event based on restoration status.
    • Improved logging and metadata capture during command execution.
    • Introduced new error types and enhanced error reporting for task processing.
  • Bug Fixes

    • Updated cleanup logic in the Checkpointer class for better robustness.
    • Improved error handling and logging consistency across various services.
  • Chores

    • Refactored logger usage in the ResumeAttemptService for better encapsulation.
    • Adjusted transaction handling in the EnvironmentVariablesRepository for correct scoping.
    • Enhanced error formatting with Prettier for consistency and readability.

Copy link

changeset-bot bot commented Oct 18, 2024

🦋 Changeset detected

Latest commit: cdbf5c6

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 9 packages
Name Type
@trigger.dev/core Patch
@trigger.dev/build Patch
trigger.dev Patch
@trigger.dev/sdk Patch
@internal/redis-worker Patch
@internal/zod-worker Patch
@trigger.dev/database Patch
@trigger.dev/otlp-importer Patch
@internal/testcontainers Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Contributor

coderabbitai bot commented Oct 18, 2024

Caution

Review failed

The pull request is closed.

Walkthrough

The pull request introduces several modifications across multiple files, primarily enhancing error handling, logging, and cleanup processes. Key changes include updates to the Checkpointer and Exec classes for improved logging and error management, adjustments in the RunError component for better error presentation, and modifications in various services to refine transaction handling and data retrieval. New methods and constants are added, while some existing method signatures are updated for improved usability and clarity.

Changes

File Change Summary
apps/coordinator/src/checkpointer.ts Updated cleanup function in Checkpointer class for robust abort controller removal; enhanced logging and error handling in checkpointAndPush method.
apps/coordinator/src/exec.ts Enhanced x method in Exec class with new globalOpts object for metadata; updated output structure to include more detailed execution context.
apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx Added imports and improved error handling in RunError component; checks link.magic to conditionally render Feedback component or fallback to Callout.
apps/webapp/app/v3/environmentVariables/environmentVariablesRepository.server.ts Adjusted transaction handling in create method of EnvironmentVariablesRepository class for correct scoping; no functional changes.
apps/webapp/app/v3/handleSocketIo.server.ts Modified createCoordinatorNamespace function to update payload retrieval in CREATE_TASK_RUN_ATTEMPT handler; improved error handling and logging.
apps/webapp/app/v3/marqs/sharedQueueConsumer.server.ts Added $replica import and constants for task run statuses; significant changes in doWorkInternal method for handling message types; refactored getExecutionPayloadFromAttempt method to accept an object as an argument.
apps/webapp/app/v3/requeueTaskRun.server.ts Updated call method in RequeueTaskRunService to enhance field selection for taskRun queries; added logic for handling completed task runs and emitting cancellation events.
apps/webapp/app/v3/services/restoreCheckpoint.server.ts Added getLastCheckpointEventIfUnrestored method to RestoreCheckpointService class for querying checkpoint events; existing call method remains unchanged.
apps/webapp/app/v3/services/resumeAttempt.server.ts Refactored logger usage in ResumeAttemptService class; updated getExecutionPayloadFromAttempt method to accept an object for better usability; removed WaitReason import.
packages/core/src/v3/errors.ts Introduced new ErrorLink type and updated EnhanceError type; modified error handling logic to include new error type TASK_PROCESS_SIGTERM.
packages/core/src/v3/schemas/common.ts Removed TaskRunErrorCodes constant and replaced it with a new declaration derived from the TaskRunInternalError schema; added new error codes to the TaskRunInternalError enum.

Possibly related PRs

🐰 In the code, we hop and play,
With logs and errors, we clear the way.
Checkpoints restored, tasks requeued,
In our rabbit burrow, all's renewed!
So let’s celebrate with a joyful cheer,
For each little change brings us near! 🎉


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

pkg-pr-new bot commented Oct 18, 2024

pnpm add https://pkg.pr.new/triggerdotdev/trigger.dev/@trigger.dev/build@1418
pnpm add https://pkg.pr.new/triggerdotdev/trigger.dev@1418
pnpm add https://pkg.pr.new/triggerdotdev/trigger.dev/@trigger.dev/core@1418
pnpm add https://pkg.pr.new/triggerdotdev/trigger.dev/@trigger.dev/sdk@1418

commit: 59d375b

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (6)
apps/webapp/app/v3/services/restoreCheckpoint.server.ts (1)

116-134: Approve with suggestions for minor improvements

The new getLastCheckpointEventIfUnrestored method is a valuable addition that aligns with the PR objectives. It provides a way to retrieve the last checkpoint event, which can be useful for addressing restore and resume bugs. The implementation is concise and uses the Prisma ORM consistently with the rest of the class.

Here are a few suggestions to further improve the method:

  1. Consider adding error handling for the database query. This will make the method more robust and easier to debug.

  2. The method name might be slightly misleading as it doesn't explicitly check if the checkpoint is unrestored. Consider either renaming the method to something like getLastCheckpointEvent or adding a comment to explain its behavior.

  3. Add a return type annotation to improve code clarity. For example:

async getLastCheckpointEventIfUnrestored(runId: string): Promise<CheckpointRestoreEvent | undefined> {
  // ... existing implementation ...
}

With these minor improvements, the method will be even more robust and maintainable.

apps/webapp/app/v3/services/resumeAttempt.server.ts (3)

22-22: Excellent logging improvements with a minor suggestion

The changes significantly enhance the logging mechanism:

  1. Consistent use of the encapsulated _logger.
  2. Creation of a child logger with rich context (attemptId, attemptFriendlyId, taskRun).

These improvements will greatly aid in debugging and tracing issues.

Consider moving the child logger creation to the beginning of the method, right after the attempt is found. This would ensure all logs within the method have the additional context:

if (!attempt) {
  this._logger.error("Could not find attempt", params);
  return;
}

this._logger = this._logger.child({
  attemptId: attempt.id,
  attemptFriendlyId: attempt.friendlyId,
  taskRun: attempt.taskRun,
});

// Rest of the method...

Also applies to: 81-81, 85-90


164-164: Great logging enhancements with a suggestion

The changes continue to improve the logging mechanism:

  1. Consistent use of this._logger for error logging.
  2. Creation of a child logger with additional context for each completed attempt.

These improvements will significantly aid in tracing and debugging issues related to individual completed attempts.

Consider renaming the logger variable to avoid shadowing the class property:

const completedAttemptLogger = this._logger.child({
  completedAttemptId: completedAttempt.id,
  completedAttemptFriendlyId: completedAttempt.friendlyId,
  completedRunId: completedAttempt.taskRunId,
});

This makes it clearer that we're creating a new logger instance for the completed attempt.

Also applies to: 187-187, 192-197


Line range hint 1-255: Suggestions for further improvements

The changes in this file significantly improve the logging and error handling. To further enhance the code quality and performance, consider the following suggestions:

  1. Unit Tests: Given the complexity of the logic in this file, especially around resuming attempts and handling dependencies, it would be beneficial to add or expand unit tests. This will help ensure the correctness of the logic and make future refactoring easier.

  2. Performance Optimization: The file contains several database operations, some of which are within loops. Consider reviewing these operations for potential performance optimizations. For example:

    • Can any of the database queries be combined or optimized?
    • Is there an opportunity to use batch operations for multiple updates?
    • Could any of the database operations benefit from caching frequently accessed data?
  3. Error Handling: While the error logging has been improved, consider adding more specific error types or error codes. This could help in better categorizing and handling different types of errors that may occur during the resume process.

Would you like assistance in identifying specific areas for additional unit tests or performance optimizations?

apps/coordinator/src/exec.ts (1)

67-78: Improved metadata logging enhances debugging capabilities.

The addition of globalOpts, localOpts, and explicit output properties in the metadata object significantly improves the logging capabilities of the x method. This change provides more context for debugging and monitoring command executions.

Consider adding a timestamp field to the metadata object to further enhance debugging capabilities. This can be useful for tracking execution times and identifying potential performance issues.

metadata = {
  // ... existing fields ...
+ timestamp: new Date().toISOString(),
};
apps/webapp/app/v3/handleSocketIo.server.ts (1)

198-201: LGTM! Consider destructuring for consistency.

The changes to getExecutionPayloadFromAttempt look good. The new object parameter style improves readability and flexibility. This aligns well with the PR's objectives of addressing restore and resume bugs.

For consistency with the surrounding code style, consider using object destructuring:

const payload = await sharedQueueTasks.getExecutionPayloadFromAttempt({
  id: attempt.id,
  setToExecuting: true,
});

This minor change would align with modern JavaScript/TypeScript practices and maintain consistency with other parts of the codebase.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 235ab90 and 59d375b.

📒 Files selected for processing (11)
  • apps/coordinator/src/checkpointer.ts (1 hunks)
  • apps/coordinator/src/exec.ts (1 hunks)
  • apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx (3 hunks)
  • apps/webapp/app/v3/environmentVariables/environmentVariablesRepository.server.ts (1 hunks)
  • apps/webapp/app/v3/handleSocketIo.server.ts (1 hunks)
  • apps/webapp/app/v3/marqs/sharedQueueConsumer.server.ts (13 hunks)
  • apps/webapp/app/v3/requeueTaskRun.server.ts (2 hunks)
  • apps/webapp/app/v3/services/restoreCheckpoint.server.ts (1 hunks)
  • apps/webapp/app/v3/services/resumeAttempt.server.ts (6 hunks)
  • packages/core/src/v3/errors.ts (5 hunks)
  • packages/core/src/v3/schemas/common.ts (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • apps/webapp/app/v3/environmentVariables/environmentVariablesRepository.server.ts
🧰 Additional context used
🔇 Additional comments (21)
apps/webapp/app/v3/services/resumeAttempt.server.ts (3)

11-11: LGTM: Improved modularity and encapsulation

The changes here improve the code structure:

  1. Importing sharedQueueTasks enhances modularity.
  2. Introducing a private _logger property allows for better encapsulation of logging behavior.

These modifications align with good software engineering practices.

Also applies to: 17-18


92-92: LGTM: Consistent logger usage

These changes consistently apply the use of the encapsulated _logger throughout the method. This ensures that all log messages benefit from the additional context provided by the child logger, improving traceability and debugging capabilities.

Also applies to: 100-100, 117-117, 123-123, 137-137, 143-143


203-203: Good improvements with a request for clarification

The changes enhance the code in several ways:

  1. Consistent use of the context-rich logger for error logging.
  2. Updated method call to getExecutionPayloadFromAttempt with more explicit parameters.

Could you please clarify the implications of the skipStatusChecks: true parameter? While the comment suggests this is an optimization, it would be helpful to understand:

  1. What status checks are being skipped?
  2. Are there any potential risks associated with skipping these checks?
  3. Is there a way to verify that skipping these checks is always safe in this context?

Consider adding a more detailed comment explaining the rationale behind this optimization.

Also applies to: 210-213, 216-216

packages/core/src/v3/schemas/common.ts (2)

92-92: Approved: New error codes enhance error handling capabilities

The addition of "TASK_PROCESS_SIGTERM" and "TASK_RUN_HEARTBEAT_TIMEOUT" to the TaskRunInternalError enum aligns well with the PR objectives. These new error codes will improve the system's ability to detect and handle specific error scenarios, such as task exits due to SIGTERM signals and potential timeout issues that could lead to "frozen" states.

Also applies to: 98-98


112-113: Approved: Improved error code management

The redefinition of TaskRunErrorCodes to derive its values directly from TaskRunInternalError.shape.code.enum is a smart refactoring. This change ensures that TaskRunErrorCodes always stays in sync with the error codes defined in TaskRunInternalError, reducing the risk of inconsistencies and simplifying future maintenance.

apps/webapp/app/v3/handleSocketIo.server.ts (1)

Line range hint 1-424: Changes align well with PR objectives

The modification to the CREATE_TASK_RUN_ATTEMPT handler in this file is a targeted change that aligns well with the PR's objective of addressing restore and resume bugs. The update to getExecutionPayloadFromAttempt appears to be part of a broader effort to improve the handling of task run attempts, which could contribute to resolving issues with runs becoming stuck or failing to resume properly.

The localized nature of the change minimizes the risk of unintended side effects while potentially improving the system's ability to manage task run attempts effectively. This update seems to be a positive step towards achieving the goals outlined in the PR summary.

apps/webapp/app/v3/requeueTaskRun.server.ts (2)

9-9: Import of 'socketIo' added correctly

The import statement for socketIo is properly added, enabling socket communication within this file.


107-107: Review the logic for 'delayInMs' computation

The expression for delayInMs:

delayInMs: taskRun.lockedToVersion?.supportsLazyAttempts ? 5_000 : undefined,

implies that if supportsLazyAttempts is true, delayInMs is 5000, otherwise it's undefined. Confirm that this aligns with the intended behavior, especially if supportsLazyAttempts can be undefined.

If the intention is to have no delay when supportsLazyAttempts is false or undefined, this logic is correct. Otherwise, you might consider setting a default value.

packages/core/src/v3/errors.ts (1)

389-396: Consistent use of optional chaining for error.message

In this block, you've correctly used optional chaining with error.message?.includes("SIGTERM"). This ensures that if error.message is undefined, it won't cause a runtime error.

apps/coordinator/src/checkpointer.ts (1)

439-442: Prevent unintended removal of controllers in cleanup

By checking that this.#abortControllers.get(runId) equals the current controller before deleting, you ensure that only the controller associated with the current operation is removed. This prevents accidentally deleting controllers that may still be in use by other processes.

apps/webapp/app/v3/marqs/sharedQueueConsumer.server.ts (11)

24-24: Usage of $replica for read operations

Introducing $replica for read operations in the #heartbeat method is appropriate to reduce the load on the primary database. Make sure that any potential for stale data does not negatively impact the application's behavior.


46-51: Imported final status constants and utility functions correctly

The constants FINAL_ATTEMPT_STATUSES, FINAL_RUN_STATUSES, and the functions isFinalAttemptStatus, isFinalRunStatus are correctly imported from ../taskStatus and are used appropriately in status checks throughout the code.


628-630: Proper status check using notIn with FINAL_RUN_STATUSES

The query correctly filters out runs with final statuses using status: { notIn: FINAL_RUN_STATUSES }, ensuring only active runs are processed.


644-650: Verify resuming runs not in 'EXECUTING' status

The code logs a warning when resumableRun.status is not 'EXECUTING' but proceeds to attempt a resume. Ensure that resuming runs in other statuses won't lead to unexpected behavior or conflicts.

Consider verifying if additional statuses should be handled differently before attempting to resume.


Line range hint 737-870: Enhanced error handling and retry logic during resumption

The updates improve error handling by logging detailed warnings and attempting to restore checkpoints when resumption fails. Ensure that the retry mechanism does not result in infinite loops in cases of persistent failures.

Consider implementing a retry limit to prevent potential infinite loops due to continuous failures.


995-995: Correct usage of FINAL_ATTEMPT_STATUSES in status check

The condition status: { in: FINAL_ATTEMPT_STATUSES } accurately checks for attempts with final statuses, ensuring only completed attempts are processed further.


1041-1051: Updated method signature with named parameters

The getExecutionPayloadFromAttempt method now accepts an object with named parameters, including the new skipStatusChecks option. This enhances the method's flexibility and readability.


1084-1106: Conditional status checks based on skipStatusChecks

Introducing the skipStatusChecks flag allows conditional execution of status validations. The status checks are properly scoped within the if (!skipStatusChecks) block, and the switch cases are correctly structured.


1257-1261: Adjusted method call to align with updated signature

The call to getExecutionPayloadFromAttempt now uses named parameters, aligning with the updated method signature. This change improves code clarity and reduces the risk of passing incorrect arguments.


1336-1342: Refactored heartbeat logic into a private method #heartbeat

Consolidating the heartbeat functionality into the #heartbeat private method reduces code duplication and simplifies maintenance. The methods taskHeartbeat and taskRunHeartbeat appropriately delegate to this new method.


1353-1411: Efficient heartbeat handling with read replica and run cancellation

The #heartbeat method efficiently uses $replica to reduce load on the primary database. It correctly handles final run statuses by emitting a REQUEST_RUN_CANCELLATION event, ensuring that leftover processes are terminated appropriately.

apps/webapp/app/v3/requeueTaskRun.server.ts Show resolved Hide resolved
apps/webapp/app/v3/requeueTaskRun.server.ts Show resolved Hide resolved
packages/core/src/v3/errors.ts Outdated Show resolved Hide resolved
packages/core/src/v3/errors.ts Show resolved Hide resolved
packages/core/src/v3/errors.ts Show resolved Hide resolved
@nicktrn nicktrn merged commit 90593ad into main Oct 18, 2024
0 of 7 checks passed
@nicktrn nicktrn deleted the fix/resume-restore-bugs branch October 18, 2024 15:46
nicktrn added a commit that referenced this pull request Oct 22, 2024
* try to correct resume messages with missing checkpoint

* prevent creating checkpoints for outdated task waits

* prevent creating checkpoints for outdated batch waits

* use heartbeats to check for and clean up any leftover containers

* lint

* improve exec logging

* improve resume attempt logs

* fix for resuming parents of canceled child runs

* separate SIGTERM from maybe OOM errors

* pretty errors can have magic dashboard links

* prevent uncancellable checkpoints

* simplify task run error code enum export

* grab the last, not the first child run

* Revert "prevent creating checkpoints for outdated batch waits"

This reverts commit f2b5c2a.

* Revert "grab the last, not the first child run"

This reverts commit 89ec5c8.

* Revert "prevent creating checkpoints for outdated task waits"

This reverts commit 11066b4.

* more logs for resume message handling

* add magic error link comment

* add changeset
nicktrn added a commit that referenced this pull request Oct 24, 2024
* refactor finalize run service

* refactor complete attempt service

* remove separate graceful exit handling

* refactor task status helpers

* clearly separate statuses in prisma schema

* all non-final statuses should be failable

* new import payload error code

* store default retry config if none set on task

* failed run service now respects retries

* fix merged task retry config indexing

* some errors should never be retried

* finalize run service takes care of acks now

* execution payload helper now with single object arg

* internal error code enum export

* unify failed and crashed run retries

* Prevent uncaught socket ack exceptions (#1415)

* catch all the remaining socket acks that could possibly throw

* wrap the remaining handlers in try catch

* New onboarding question (#1404)

* Updated “Twitter” to be “X (Twitter)”

* added Textarea to storybook

* Updated textarea styling to match input field

* WIP adding new text field to org creation page

* Added description to field

* Submit feedback to Plain when an org signs up

* Formatting improvement

* type improvement

* removed userId

* Moved submitting to Plain into its own file

* Change orgName with name

* use sendToPlain function for the help & feedback email form

* use name not orgName

* import cleanup

* Downgrading plan form uses sendToPlain

* Get the userId from requireUser only

* Added whitespace-pre-wrap to the message property on the run page

* use requireUserId

* Removed old Plain submit code

* Added a new Context page for the docs (#1416)

* Added a new context page with task context properties

* Removed code comments

* Added more crosslinks

* Fix updating many environment variables at once (#1413)

* Move code example to the side menu

* New docs example for creating a HN email summary

* doc: add instructions to create new reference project and run it locally (#1417)

* doc: add instructions to create new reference project and run it locally

* doc: Add instruction for running tunnel

* minor language improvement

* Fix several restore and resume bugs (#1418)

* try to correct resume messages with missing checkpoint

* prevent creating checkpoints for outdated task waits

* prevent creating checkpoints for outdated batch waits

* use heartbeats to check for and clean up any leftover containers

* lint

* improve exec logging

* improve resume attempt logs

* fix for resuming parents of canceled child runs

* separate SIGTERM from maybe OOM errors

* pretty errors can have magic dashboard links

* prevent uncancellable checkpoints

* simplify task run error code enum export

* grab the last, not the first child run

* Revert "prevent creating checkpoints for outdated batch waits"

This reverts commit f2b5c2a.

* Revert "grab the last, not the first child run"

This reverts commit 89ec5c8.

* Revert "prevent creating checkpoints for outdated task waits"

This reverts commit 11066b4.

* more logs for resume message handling

* add magic error link comment

* add changeset

* chore: Update version for release (#1410)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Release 3.0.13

* capture ffmpeg oom errors

* respect maxAttempts=1 when failing before first attempt creation

* request worker exit on fatal errors

* fix error code merge

* add new error code to should retry

* pretty segfault errors

* pretty internal errors for attempt spans

* decrease oom false positives

* fix timeline event color for failed runs

* auto-retry packet import and export

* add sdk version check and complete event while completing attempt

* all internal errors become crashes by default

* use pretty error helpers exclusively

* error to debug log

* zodfetch fixes

* rename import payload to task input error

* fix true non-zero exit error display

* fix retry config parsing

* correctly mark crashes as crashed

* add changeset

* remove non-zero exit comment

* pretend we don't support default default retry configs yet

---------

Co-authored-by: James Ritchie <[email protected]>
Co-authored-by: shubham yadav <[email protected]>
Co-authored-by: Tarun Pratap Singh <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant