-
-
Notifications
You must be signed in to change notification settings - Fork 623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically reattempt after internal errors #1424
Conversation
* catch all the remaining socket acks that could possibly throw * wrap the remaining handlers in try catch
* Updated “Twitter” to be “X (Twitter)” * added Textarea to storybook * Updated textarea styling to match input field * WIP adding new text field to org creation page * Added description to field * Submit feedback to Plain when an org signs up * Formatting improvement * type improvement * removed userId * Moved submitting to Plain into its own file * Change orgName with name * use sendToPlain function for the help & feedback email form * use name not orgName * import cleanup * Downgrading plan form uses sendToPlain * Get the userId from requireUser only * Added whitespace-pre-wrap to the message property on the run page * use requireUserId * Removed old Plain submit code
* Added a new context page with task context properties * Removed code comments * Added more crosslinks
…lly (#1417) * doc: add instructions to create new reference project and run it locally * doc: Add instruction for running tunnel * minor language improvement
* try to correct resume messages with missing checkpoint * prevent creating checkpoints for outdated task waits * prevent creating checkpoints for outdated batch waits * use heartbeats to check for and clean up any leftover containers * lint * improve exec logging * improve resume attempt logs * fix for resuming parents of canceled child runs * separate SIGTERM from maybe OOM errors * pretty errors can have magic dashboard links * prevent uncancellable checkpoints * simplify task run error code enum export * grab the last, not the first child run * Revert "prevent creating checkpoints for outdated batch waits" This reverts commit f2b5c2a. * Revert "grab the last, not the first child run" This reverts commit 89ec5c8. * Revert "prevent creating checkpoints for outdated task waits" This reverts commit 11066b4. * more logs for resume message handling * add magic error link comment * add changeset
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
🦋 Changeset detectedLatest commit: d7b246e The changes in this PR will be included in the next version bump. This PR includes changesets to release 10 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
WalkthroughThe pull request introduces several enhancements across multiple components related to task run management, focusing on improving error handling and run status evaluation. Key changes include the addition of new status checks and error logging mechanisms in the Changes
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
🧹 Outside diff range and nitpick comments (14)
apps/webapp/app/routes/api.v1.runs.$runParam.attempts.ts (2)
32-35
: LGTM! Consider destructuring for consistency.The refactoring to use an object parameter for the
CreateTaskRunAttemptService.call()
method improves code structure and aligns with the service updates. This change enhances readability and maintainability.For consistency with the destructuring used in the function parameters, consider applying this minor change:
- const { execution } = await service.call({ - runId: runParam, - authenticatedEnv: authenticationResult.environment, - }); + const { execution } = await service.call({ + runId: runParam, + authenticatedEnv: authenticationResult.environment, + });This small adjustment maintains a consistent style throughout the function.
Line range hint
1-53
: Overall, the changes improve code structure without altering core functionality.The refactoring of the
CreateTaskRunAttemptService.call()
method parameter aligns well with the broader changes in error handling and status evaluation mentioned in the PR objectives. The existing error handling remains robust, ensuring that the function continues to handle invalid API keys and malformed parameters appropriately.As the system evolves, consider implementing a more granular error handling mechanism that can distinguish between different types of internal errors (e.g., OOM errors, segfaults) as mentioned in the PR objectives. This could involve passing more detailed error information from the service layer to the API response, allowing clients to handle specific error cases more effectively.
packages/cli-v3/src/entryPoints/dev-index-worker.ts (1)
103-115
: LGTM: Default retry configuration logic addedThe new logic for applying default retry settings to tasks is well-implemented and aligns with the PR objectives. It correctly handles cases where tasks may not have retry settings defined.
A minor optimization suggestion:
Consider using the nullish coalescing operator (
??
) to simplify the logic:tasks = tasks.map((task) => ({ ...task, retry: task.retry ?? config.retries?.default, }));This approach would eliminate the need for the nested if statement while achieving the same result.
apps/webapp/app/v3/services/createTaskRunAttempt.server.ts (2)
15-25
: Improved method signature with object parameter.The change to use an object parameter enhances flexibility and readability. The addition of the
startAtZero
parameter provides more control over the initial attempt number, which aligns well with the PR objectives for improving error handling and retry mechanisms.Consider adding JSDoc comments to describe the purpose and behavior of the new
startAtZero
parameter for better documentation.
111-115
: Enhanced logic for determiningnextAttemptNumber
.The updated logic for calculating
nextAttemptNumber
now considers thestartAtZero
parameter, providing more flexibility in managing task run attempts. This change aligns well with the PR objectives for improving error handling and retry mechanisms.Consider using a more explicit conditional expression for better readability:
const nextAttemptNumber = taskRun.attempts[0] ? taskRun.attempts[0].number + 1 : (startAtZero ? 0 : 1);This minor change could make the logic more immediately clear to other developers.
apps/webapp/app/presenters/v3/SpanPresenter.server.ts (1)
297-297
: LGTM: Addition ofisError
property enhances error detection.The new
isError
property, utilizingisFailedRunStatus(run.status)
, provides valuable information about the run's error state. This addition aligns well with the PR's goal of improving error detection and handling.For improved clarity and maintainability, consider extracting this logic into a separate method within the class. For example:
private isErrorStatus(status: string): boolean { return isFailedRunStatus(status); } // Then in the return statement: isError: this.isErrorStatus(run.status),This approach would make it easier to modify or extend the error detection logic in the future if needed.
apps/webapp/app/v3/handleSocketIo.server.ts (1)
196-200
: LGTM! Consider adding a comment for clarity.The update to use an object parameter in the
CreateTaskRunAttemptService
call is a good improvement. It aligns with the PR objectives and enhances code flexibility.Consider adding a brief comment explaining why
setToExecuting
is set tofalse
here, as it might not be immediately obvious to future readers. For example:const { attempt } = await service.call({ runId: message.runId, authenticatedEnv: environment, setToExecuting: false, // Execution state will be set later when retrieving the payload });apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx (2)
860-860
: Improved error state handling in RunTimelineEventThe change enhances the
RunTimelineEvent
component by dynamically setting the state based on whether the run has encountered an error. This improvement provides more accurate visual feedback to users about the run's status.Consider extracting the state logic into a separate function for better readability and maintainability. For example:
const getEventState = (isError: boolean) => isError ? "error" : "complete"; <RunTimelineEvent title="Finished" subtitle={<DateTimeAccurate date={run.updatedAt} />} state={getEventState(run.isError)} />
Line range hint
848-862
: Enhance type safety in RunTimeline componentWhile the changes look good, consider improving type safety in the
RunTimeline
component to reduce potential runtime errors.Consider using TypeScript's non-null assertion operator or optional chaining to handle potential undefined values. For example:
{run.isFinished ? ( <> <RunTimelineLine title={formatDuration(run.startedAt!, run.updatedAt)} state={"complete"} /> <RunTimelineEvent title="Finished" subtitle={<DateTimeAccurate date={run.updatedAt} />} state={run.isError ? "error" : "complete"} /> </> ) : ( <RunTimelineLine title={ <span className="flex items-center gap-1"> <Spinner className="size-4" /> <span> <LiveTimer startTime={run.startedAt!} /> </span> </span> } state={"inprogress"} /> )}This change assumes that
run.startedAt
is always defined whenrun.isFinished
is true. If this assumption is not always true, consider adding additional checks or using optional chaining (run.startedAt?.
).packages/cli-v3/src/entryPoints/deploy-index-worker.ts (1)
110-110
: Re-evaluate the necessity ofsatisfies TaskManifest
assertionsThe
satisfies TaskManifest
assertion is used multiple times when returning modified task objects. If the objects already conform toTaskManifest
, these assertions might be redundant.Consider removing the assertions if they are unnecessary:
- } satisfies TaskManifest; + };Also applies to: 126-126, 140-140
packages/core/src/v3/workers/taskExecutor.ts (2)
Line range hint
351-476
: Avoid variable shadowing of 'delay' within nested scopes.In the block starting at line 456, the variable
delay
is redefined inside a nested scope. This can lead to confusion and potential errors. Consider renaming the innerdelay
variable to improve code clarity.Apply this diff to rename the variable:
if (handleErrorResult.retry && typeof handleErrorResult.retry === "object") { - const delay = calculateNextRetryDelay(handleErrorResult.retry, execution.attempt.number); + const nextDelay = calculateNextRetryDelay(handleErrorResult.retry, execution.attempt.number); return typeof nextDelay === "undefined" ? { status: "noop", error: handleErrorResult.error } : { status: "retry", - retry: { timestamp: Date.now() + delay, delay }, + retry: { timestamp: Date.now() + nextDelay, delay: nextDelay }, error: handleErrorResult.error, }; }
Line range hint
351-476
: Refactor nested ternary operators for better readability.The use of nested ternary operators can reduce code readability and maintainability. Refactoring them into
if-else
statements will enhance clarity.For example, refactor the code starting at line 423:
if (!handleErrorResult) { - return typeof delay === "undefined" - ? { status: "noop" } - : { status: "retry", retry: { timestamp: Date.now() + delay, delay } }; + if (typeof delay === "undefined") { + return { status: "noop" }; + } else { + return { status: "retry", retry: { timestamp: Date.now() + delay, delay } }; + } }Similarly, refactor other nested ternary operations within the
#handleError
method.apps/webapp/app/v3/services/completeAttempt.server.ts (1)
293-294
: Address the TODO comment regarding default crash behaviorThere's a TODO comment indicating uncertainty about defaulting internal errors to "CRASHED" status. Resolving this will clarify the intended error-handling strategy.
internal-packages/database/prisma/schema.prisma (1)
1966-1970
: Align comment formatting for consistency.In the
TaskRunAttemptStatus
enum, consider matching the comment style used in theTaskRunStatus
enum by adding surrounding slashes and blank lines. This ensures consistency and enhances readability.Apply this diff to align the comment formatting:
+ /// /// NON-FINAL + /// PENDING EXECUTING PAUSED + /// /// FINAL + /// FAILED CANCELED COMPLETED
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
pnpm-lock.yaml
is excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (20)
- apps/webapp/app/components/runs/v3/RunInspector.tsx (3 hunks)
- apps/webapp/app/presenters/v3/SpanPresenter.server.ts (2 hunks)
- apps/webapp/app/routes/api.v1.runs.$runParam.attempts.ts (1 hunks)
- apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx (1 hunks)
- apps/webapp/app/v3/failedTaskRun.server.ts (3 hunks)
- apps/webapp/app/v3/handleSocketIo.server.ts (1 hunks)
- apps/webapp/app/v3/marqs/sharedQueueConsumer.server.ts (1 hunks)
- apps/webapp/app/v3/services/completeAttempt.server.ts (10 hunks)
- apps/webapp/app/v3/services/crashTaskRun.server.ts (4 hunks)
- apps/webapp/app/v3/services/createTaskRunAttempt.server.ts (2 hunks)
- apps/webapp/app/v3/services/finalizeTaskRun.server.ts (3 hunks)
- apps/webapp/app/v3/taskStatus.ts (2 hunks)
- apps/webapp/package.json (2 hunks)
- internal-packages/database/prisma/schema.prisma (3 hunks)
- packages/cli-v3/src/entryPoints/deploy-index-worker.ts (4 hunks)
- packages/cli-v3/src/entryPoints/dev-index-worker.ts (2 hunks)
- packages/core/src/v3/errors.ts (6 hunks)
- packages/core/src/v3/schemas/common.ts (2 hunks)
- packages/core/src/v3/utils/ioSerialization.ts (3 hunks)
- packages/core/src/v3/workers/taskExecutor.ts (1 hunks)
🧰 Additional context used
🪛 Biome
packages/core/src/v3/errors.ts
[error] 155-189: This case is falling through to the next case.
Add a
break
orreturn
statement to the end of this case to prevent fallthrough.(lint/suspicious/noFallthroughSwitchClause)
🔇 Additional comments (52)
packages/cli-v3/src/entryPoints/dev-index-worker.ts (3)
6-6
: LGTM: TaskManifest import addedThe addition of the
TaskManifest
type import is appropriate and aligns with the changes made later in the file for improved type checking of task configurations.
124-124
: LGTM: Type assertion added to maxDuration logicThe addition of
satisfies TaskManifest
to the maxDuration logic enhances type safety and consistency. This change aligns well with the similar modification in the retry configuration logic.
Line range hint
1-170
: Summary: Improvements in task configuration and type safetyThe changes in this file successfully implement the following improvements:
- Addition of default retry configuration for tasks
- Enhanced type checking using the
TaskManifest
type- Consistent use of
satisfies
keyword for type assertionsThese modifications align well with the PR objectives of improving error handling and enhancing the robustness of the task management system. The code is well-structured and maintains good practices for type safety and configuration management.
packages/core/src/v3/schemas/common.ts (1)
92-92
: LGTM: New error codes enhance error handling capabilities.The addition of "TASK_PROCESS_SIGSEGV" and "IMPORT_PAYLOAD_ERROR" to the
TaskRunInternalError
schema aligns well with the PR objectives. These new error codes will improve the system's ability to detect and handle specific error scenarios, particularly segmentation faults and issues with importing large payload packets.Also applies to: 106-106
apps/webapp/app/v3/services/createTaskRunAttempt.server.ts (1)
Line range hint
1-280
: Overall assessment: Changes align well with PR objectives.The modifications to the
CreateTaskRunAttemptService
class, particularly the updatedcall
method, effectively enhance the flexibility and control over task run attempts. These changes are consistent with the PR objectives of improving error handling and retry mechanisms.Key improvements:
- The new object parameter structure in the
call
method enhances code readability and flexibility.- The addition of the
startAtZero
parameter allows for more nuanced control over initial attempt numbering.- The updated logic for
nextAttemptNumber
calculation accommodates various retry scenarios.These enhancements contribute to a more robust and configurable system for managing task run attempts, which should lead to improved error handling and system reliability.
apps/webapp/package.json (1)
167-167
: Approve the addition of semver and its type definitions.The addition of
semver
and@types/semver
is a good practice, especially in a TypeScript project. This suggests an intention to use version comparison functionality, which could be beneficial for error handling or compatibility checks.Could you please clarify the intended use of the
semver
library in the context of this PR's objectives? This will help ensure it aligns with the error handling and retry mechanism improvements mentioned in the PR description.To verify the usage of
semver
in the codebase, you can run the following script:This will help us understand how
semver
is being utilized across the project.Also applies to: 217-217
✅ Verification successful
Verified the usage of
semver
aligns with the PR objectives.
- Uses of
semver
found in:
apps/webapp/app/v3/failedTaskRun.server.ts
packages/cli-v3/src/commands/update.ts
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Search for semver usage in the codebase # Look for import statements or require calls for semver echo "Searching for semver imports or requires:" rg -i '(import.*from.*semver|require.*semver)' --type ts --type js # Look for direct usage of semver functions echo "\nSearching for semver function usage:" rg -i 'semver\.(valid|satisfies|gt|lt|eq)' --type ts --type jsLength of output: 788
apps/webapp/app/presenters/v3/SpanPresenter.server.ts (2)
10-10
: LGTM: Import addition aligns with PR objectives.The addition of
isFailedRunStatus
import from thetaskStatus
module is consistent with the PR's goal of improving error detection and handling. This function will likely be used to determine if a run status indicates a failure, contributing to the unification of logic for handling failed and crashed runs.
Line range hint
1-362
: Summary: Enhancements to SpanPresenter improve error detection capabilities.The changes made to
SpanPresenter.server.ts
effectively contribute to the PR's objective of improving error handling and detection. By introducing theisFailedRunStatus
function and theisError
property, the class now provides more detailed and accessible information about run error states. These modifications align well with the overall goal of unifying logic for handling failed and crashed runs across the system.These enhancements will likely improve the system's ability to detect and respond to errors, potentially leading to more robust error handling and better user feedback. As part of the larger set of changes in this PR, these modifications to
SpanPresenter
play a crucial role in achieving the stated objectives.apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx (2)
860-860
: Alignment with PR objectives: Enhanced error handling and visual clarityThis change directly contributes to the PR's objectives by improving the visual representation of errors in the run timeline. It enhances the user experience by providing more accurate and clear feedback about the run's status, particularly when errors occur.
Line range hint
1-862
: Summary: Improved error handling in RunTimelineEventThe changes in this file successfully enhance the error handling and visual representation of run statuses in the timeline. The modification aligns well with the PR objectives of improving error handling and visual clarity. While the implementation is sound, consider the suggested minor optimizations for improved code maintainability and type safety.
apps/webapp/app/v3/taskStatus.ts (2)
Line range hint
3-116
: Overall implementation is well-structured and clearThe categorization of task run and attempt statuses using constants and corresponding type definitions enhances code readability and maintainability. The utility functions provided for status checks are consistent and align with the new status groupings.
56-56
:⚠️ Potential issueTypo in type definition of
FATAL_RUN_STATUSES
The type
FATAL_RUN_STATUSES
is incorrectly defined usingFAILED_RUN_STATUSES
instead ofFATAL_RUN_STATUSES
. This could lead to type inconsistencies and unexpected behavior when using this type elsewhere in the codebase.Apply this diff to fix the typo:
-export type FATAL_RUN_STATUSES = (typeof FAILED_RUN_STATUSES)[number]; +export type FATAL_RUN_STATUSES = (typeof FATAL_RUN_STATUSES)[number];Likely invalid or redundant comment.
packages/cli-v3/src/entryPoints/deploy-index-worker.ts (1)
6-6
: ImportingTaskManifest
type ensures type safetyThe addition of
type TaskManifest
to the imports enhances type checking for task-related operations.apps/webapp/app/v3/services/crashTaskRun.server.ts (11)
9-9
: ImportingFailedTaskRunRetryHelper
is NecessaryThe addition of
FailedTaskRunRetryHelper
import is appropriate for implementing the retry mechanism for failed task runs.
39-40
: Enhanced Error Logging for Missing Task RunsGood job adding error logging when the task run is not found. This will aid in debugging issues related to missing task runs.
45-48
: Improved Logging for Non-Crashable Task RunsAdding detailed error logs when a task run is not in a crashable state enhances traceability and debugging capability.
52-52
: Consistent Debug LoggingLogging the commencement of attempt completion helps in tracing the execution flow and is a good practice.
54-67
: Proper Implementation of Retry MechanismThe integration of
FailedTaskRunRetryHelper
to handle retries is well-structured. The error object within thecompletion
parameter is correctly populated with relevant details.
69-74
: Correct Handling of Retry ResultsAppropriately checking if the task run was retried and returning early ensures that the logic flows correctly after a retry.
76-78
: Conditional Return Based onoverrideCompletion
The check for
opts.overrideCompletion
before proceeding further maintains the integrity of the override process.
80-80
: Logging Override of Task Run CompletionLogging when overriding the task run completion provides transparency and aids in debugging.
123-125
: Debug Logging for Crashing In-Progress EventsIncluding the IDs of in-progress events being crashed enhances visibility and facilitates easier tracking of crashed events.
172-177
: Setting Tracing Attributes for Better ObservabilityAnnotating the span with
taskRunId
andattemptId
improves observability in distributed tracing systems.
179-193
: Ensure Proper Sanitization of Error LogsWhen updating
taskRunAttempt
, the error object includesstackTrace: error.logs
. To prevent potential leakage of sensitive information, please ensure thaterror.logs
is appropriately sanitized before being assigned tostackTrace
.If
sanitizeError()
adequately sanitizeserror.logs
, no further action is needed. Otherwise, consider sanitizingerror.logs
prior to passing it tosanitizeError()
.apps/webapp/app/v3/services/finalizeTaskRun.server.ts (4)
6-11
: Imports updated to handle fatal run statusesThe addition of
isFatalRunStatus
andFINAL_ATTEMPT_STATUSES
imports allows the service to properly detect and handle fatal run statuses.
16-16
: ImportingsocketIo
for run cancellation signalingThe import of
socketIo
is necessary for emitting run cancellation requests when a fatal status is detected.
99-133
: Handling fatal run statuses and requesting worker exitThe added code block correctly handles scenarios where the run status is fatal. It logs the error, retrieves the extended run information, and, if appropriate, emits a
REQUEST_RUN_CANCELLATION
event viasocketIo
to signal a worker exit.
156-159
: Verify thatattemptStatus
orerror
is always provided tofinalizeAttempt
The early return ensures that
finalizeAttempt
does not proceed without anattemptStatus
orerror
. Please verify that all calls tofinalizeAttempt
provide at least one of these parameters to prevent unintended skips in attempt finalization.✅ Verification successful
Verified: All calls to
finalizeAttempt
provide eitherattemptStatus
orerror
as required.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that all calls to `finalizeAttempt` provide either `attemptStatus` or `error`. # Search for all calls to `finalizeAttempt` and check for provided parameters rg 'finalizeAttempt\(\{[^}]*\}\)' -t ts # Expected: Each call should include `attemptStatus` or `error` in the arguments.Length of output: 160
packages/core/src/v3/utils/ioSerialization.ts (2)
7-8
: Imports are appropriate and correctly includedThe new imports for
zodfetch
andz
are necessary for the updated functionality usingzodfetch
for fetching and Zod for validation.
119-126
: Verify that the retry options align with system requirementsEnsure that the configured retry options in
ioRetryOptions
are appropriate for the expected network conditions and service-level agreements (SLAs). Specifically, confirm that the minimum and maximum timeouts, maximum attempts, backoff factor, and randomization settings are suitable for handling transient network failures without causing excessive delay or load on the system.packages/core/src/v3/workers/taskExecutor.ts (1)
96-114
: Proper error handling for payload import failures.The added try-catch block around
conditionallyImportPacket
correctly captures exceptions and returns a structured error response, enhancing the robustness of the payload import process.apps/webapp/app/v3/services/completeAttempt.server.ts (15)
11-11
: ImportingshouldRetryError
for enhanced error handlingThe addition of
shouldRetryError
improves error evaluation for retry logic.
25-25
: Importing status utilities fromtaskStatus
Including
FAILED_RUN_STATUSES
,isFinalAttemptStatus
, andisFinalRunStatus
enhances the status handling mechanisms.
28-28
: ImportingFailedTaskRunRetryHelper
for retry managementThe introduction of
FailedTaskRunRetryHelper
assists in managing retries for failed task runs effectively.
44-53
: Adding optional parametersisSystemFailure
andisCrash
Incorporating
isSystemFailure
andisCrash
as optional parameters allows for more granular handling of failure scenarios.
119-120
: Passing failure indicators to#completeAttemptFailed
Ensuring
isSystemFailure
andisCrash
are forwarded maintains consistency in failure handling across methods.
181-191
: Updating#completeAttemptFailed
to accept new parametersModifying
#completeAttemptFailed
to includeisSystemFailure
andisCrash
aligns it with the updated error-handling logic.
227-236
: Enhancing retry determination logicUtilizing
FailedTaskRunRetryHelper.getExecutionRetry
as a fallback ensures retries are appropriately managed even whencompletion.retry
is undefined.
238-251
: Implementing conditional retry logicThe conditions for retrying failed attempts are well-defined, preventing unnecessary retries and adhering to maximum attempt limitations.
253-270
: Completing task run events on non-retry failuresProperly finalizing task run events and logging exceptions ensures accurate tracking of failures.
281-298
: Setting task run status based on failure typeThe logic for determining the
status
variable allows for precise categorization of failure types, improving error reporting.
311-358
: Handling in-progress events based on final statusThe switch-case structure effectively manages in-progress events according to whether the status is "CRASHED" or "SYSTEM_FAILURE," ensuring proper cleanup.
364-372
: Updating#enqueueReattempt
parametersAdding
executionRetry
to the method enhances its ability to schedule retries accurately.
424-501
: Creating the#retryAttempt
method for retry logicEncapsulating retry functionality within
#retryAttempt
improves code organization and readability.
503-557
: Refining#retryAttemptWithCheckpoint
Adjustments to accommodate new retry mechanisms while maintaining checkpoint functionality are well-implemented.
592-596
: SelectingsdkVersion
in thefindAttempt
functionIncluding
sdkVersion
prepares for future enhancements that may require SDK version information.apps/webapp/app/components/runs/v3/RunInspector.tsx (3)
43-43
: Importing run status helper functionsGreat addition of
isFailedRunStatus
andisFinalRunStatus
imports to enhance run status evaluation.
482-482
: DefiningisError
improves readabilityIntroducing the
isError
constant simplifies checking whether the run status indicates a failure, enhancing code clarity.
539-539
: Conditional rendering based on error stateUsing
isError
to set thestate
prop ensures that the timeline accurately reflects error statuses by displaying the correct visual state.apps/webapp/app/v3/marqs/sharedQueueConsumer.server.ts (1)
512-515
:⚠️ Potential issuePossible incorrect parameter name in
CreateTaskRunAttemptService.call()
It appears that the
CreateTaskRunAttemptService.call()
method has been updated to accept a parameterstartAtZero
. In this code segment,setToExecuting: false
is passed instead. Please verify ifsetToExecuting
should be replaced withstartAtZero
to match the updated method signature.Apply this diff to update the parameter name:
- await service.call({ - runId: lockedTaskRun.id, - setToExecuting: false, - }); + await service.call({ + runId: lockedTaskRun.id, + startAtZero: false, + });Run the following script to find all usages of
CreateTaskRunAttemptService.call()
and check for consistent parameter names:internal-packages/database/prisma/schema.prisma (2)
1804-1805
: Good use of comments to categorize enum statuses.The addition of
NON-FINAL STATUSES
comments enhances code readability by clearly grouping the non-terminal task statuses.
1828-1831
: Clear separation of final statuses aids understanding.Including
FINAL STATUSES
comments provides a distinct separation between non-final and final statuses, improving maintainability and clarity.
trigger.dev
@trigger.dev/build
@trigger.dev/core
@trigger.dev/sdk
@trigger.dev/react-hooks
commit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (6)
.changeset/eight-turtles-itch.md (1)
1-9
: Enhance changeset descriptions with more details about impact and benefits.The version bumps are appropriate for these types of changes. However, the change descriptions could be more detailed to better communicate the impact and benefits to users.
Consider expanding the descriptions like this:
- Include retries.default in task retry config when indexing - New helpers for internal error retry mechanics - Detection for segfaults and ffmpeg OOM errors - Retries for packet import and export + Include retries.default in task retry config when indexing, providing more flexible retry strategies + Add new helpers for internal error retry mechanics, improving system reliability + Implement detection for segfaults and ffmpeg OOM errors, preventing system crashes + Add automatic retries for packet import/export operations, enhancing data transfer reliabilityapps/webapp/app/v3/services/crashTaskRun.server.ts (1)
124-125
: Consider using structured logging for event IDs.While the logging is informative, consider structuring the event IDs for better log parsing.
- logger.debug("[CrashTaskRunService] Crashing in-progress events", { - inProgressEvents: inProgressEvents.map((event) => event.id), + logger.debug("[CrashTaskRunService] Crashing in-progress events", { + eventCount: inProgressEvents.length, + eventIds: inProgressEvents.map((event) => event.id), + runId });packages/core/src/v3/apiClient/core.ts (2)
Line range hint
232-251
: Review empty catch block for ValidationError.The empty catch block for
ValidationError
silently ignores validation issues, which could make debugging harder and mask real problems. This doesn't align with the PR's goal of improving error handling.Consider one of these approaches:
if (error instanceof ApiError) { throw error; } if (error instanceof ValidationError) { + // Option 1: Log and rethrow + console.error('Validation error:', error); + throw error; + + // Option 2: Convert to ApiError + throw new ApiSchemaValidationError({ + status: 400, + cause: error, + message: error.message, + headers: {}, + }); }
Line range hint
232-277
: Consider enhancing error detection for specific failure cases.While the changes improve JSON parsing safety, there might be opportunities to further align with the PR objectives by adding specific error detection for ffmpeg-related issues (OOM errors and segfaults) mentioned in the PR description.
Consider:
- Adding specific error detection in the response parsing:
function isFFmpegError(error: unknown): boolean { return error instanceof Error && ( error.message.includes("out of memory") || error.message.includes("segmentation fault") ); }
- Updating the retry logic to handle these cases:
if (error instanceof ApiError) { throw error; } + if (isFFmpegError(error)) { + // Consider different retry strategy for FFmpeg errors + const retryDelay = calculateFFmpegRetryDelay(attempt); + if (retryDelay) { + await waitForRetry(url, attempt + 1, retryDelay, options, requestInit); + return await _doZodFetchWithRetries(schema, url, requestInit, options, attempt + 1); + } + }packages/core/src/v3/workers/taskExecutor.ts (2)
96-114
: LGTM! Consider enhancing error message readability.The error handling implementation for payload import is robust and well-structured. It properly captures and formats errors while maintaining observability through span recording.
Consider using template literals for better readability of the error message:
- inputError instanceof Error - ? `${inputError.name}: ${inputError.message}` - : typeof inputError === "string" - ? inputError - : undefined, + inputError instanceof Error + ? `${inputError.name}: ${inputError.message}` + : typeof inputError === "string" + ? `Unknown error: ${inputError}` + : "Unknown error occurred",
Line range hint
154-170
: Maintain consistent error message formatting across error handlers.The error handling for output stringification is well-implemented, but the error message formatting differs from the payload import error handler. Consider maintaining consistency for better error tracking and debugging.
Apply this change to match the payload import error message format:
- outputError instanceof Error - ? outputError.message - : typeof outputError === "string" - ? outputError - : undefined, + outputError instanceof Error + ? `${outputError.name}: ${outputError.message}` + : typeof outputError === "string" + ? outputError + : undefined,
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (7)
- .changeset/eight-turtles-itch.md (1 hunks)
- apps/webapp/app/v3/failedTaskRun.server.ts (3 hunks)
- apps/webapp/app/v3/services/crashTaskRun.server.ts (4 hunks)
- packages/core/src/v3/apiClient/core.ts (2 hunks)
- packages/core/src/v3/errors.ts (6 hunks)
- packages/core/src/v3/schemas/common.ts (1 hunks)
- packages/core/src/v3/workers/taskExecutor.ts (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- packages/core/src/v3/schemas/common.ts
🧰 Additional context used
📓 Learnings (1)
apps/webapp/app/v3/failedTaskRun.server.ts (1)
Learnt from: nicktrn PR: triggerdotdev/trigger.dev#1424 File: apps/webapp/app/v3/failedTaskRun.server.ts:136-139 Timestamp: 2024-10-22T10:34:42.602Z Learning: In the `apps/webapp` project, it's acceptable to use private class methods with `#` syntax in TypeScript code.
🪛 Biome
packages/core/src/v3/errors.ts
[error] 155-189: This case is falling through to the next case.
Add a
break
orreturn
statement to the end of this case to prevent fallthrough.(lint/suspicious/noFallthroughSwitchClause)
🔇 Additional comments (10)
apps/webapp/app/v3/services/crashTaskRun.server.ts (3)
39-39
: LGTM: Improved error logging with better traceability.The addition of the "[CrashTaskRunService]" prefix and enhanced error details in log messages will make debugging easier.
Also applies to: 45-48
77-82
: LGTM: Clean completion override logic.The completion override logic is well-structured with appropriate guards and logging.
173-196
:⚠️ Potential issueImplement container cleanup after fatal errors.
According to the PR objectives, container cleanup should be performed after fatal errors. Consider adding container cleanup logic in the attempt failure handling.
Let's verify if container cleanup is implemented elsewhere:
If container cleanup is not implemented elsewhere, consider adding it:
async #failAttempt( attempt: TaskRunAttempt, run: TaskRun, failedAt: Date, environment: AuthenticatedEnvironment, error: { reason: string; logs?: string; code?: TaskRunInternalError["code"]; } ) { return await this.traceWithEnv( "[CrashTaskRunService] failAttempt()", environment, async (span) => { span.setAttribute("taskRunId", run.id); span.setAttribute("attemptId", attempt.id); + // Cleanup container resources if available + if (attempt.backgroundWorker?.containerId) { + try { + await this.cleanupContainer(attempt.backgroundWorker.containerId); + } catch (error) { + logger.error("[CrashTaskRunService] Failed to cleanup container", { + containerId: attempt.backgroundWorker.containerId, + error, + }); + } + } await this._prisma.taskRunAttempt.update({ where: { id: attempt.id, }, data: { status: "FAILED", completedAt: failedAt, error: sanitizeError({ type: "INTERNAL_ERROR", code: error.code ?? "TASK_RUN_CRASHED", message: error.reason, stackTrace: error.logs, }), }, }); } ); }apps/webapp/app/v3/failedTaskRun.server.ts (5)
1-30
: LGTM! Well-structured imports and type definitions.The imports are properly organized, and the type definitions are clear and well-documented. The Prisma include configuration for attempts is properly structured with clear ordering.
Line range hint
32-74
: LGTM! Robust error handling with proper logging.The service effectively handles failed task runs with comprehensive error checking and proper logging. The delegation of retry logic to
FailedTaskRunRetryHelper
promotes good separation of concerns.
81-134
: LGTM! Well-structured retry handling with proper error management.The
call
method effectively orchestrates the retry process with appropriate error handling and logging.
190-280
:⚠️ Potential issuePotential issue with delay validation.
The condition
if (!delay)
on line 259 could incorrectly handle a delay of 0 (immediate retry). Consider using explicit null check.Apply this diff:
- if (!delay) { + if (delay === undefined) {Likely invalid or redundant comment.
136-188
: Consider adding error recovery mechanism for attempt creation failures.While the error handling is present, the method silently returns undefined on attempt creation failure. Consider implementing a fallback mechanism or retrying the attempt creation.
packages/core/src/v3/errors.ts (2)
449-465
: Previous regex refactor suggestion still applies
182-184
: Address TODO comment regarding OOM error handlingThe TODO comment raises an important concern about retrying tasks that exit with non-zero codes, as these could be OOM errors. Retrying OOM errors without addressing the underlying memory issues could lead to resource exhaustion.
Let's analyze the error patterns to help make an informed decision:
Would you like me to help implement memory usage detection logic to better distinguish between OOM and other non-zero exit scenarios?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
apps/webapp/app/v3/failedTaskRun.server.ts (1)
Line range hint
38-53
: Consider adding type narrowing for the taskRun lookup.While the error handling is good, we could improve type safety by adding a type guard after the findFirst operation.
const taskRun = await this._prisma.taskRun.findFirst({ where: { friendlyId: isFriendlyId ? anyRunId : undefined, id: !isFriendlyId ? anyRunId : undefined, }, + // Add select to specify exactly what we need + select: { + id: true, + status: true, + }, });packages/core/src/v3/errors.ts (1)
Line range hint
413-446
: Consider standardizing error message structure.For consistency with other error messages, consider moving the conditional phrase to the end:
- "Your task crashed with a segmentation fault (SIGSEGV). Most likely there's a bug in a package or binary you're using. If this keeps happening and you're unsure why, please get in touch.", + "Your task crashed with a segmentation fault (SIGSEGV). Most likely there's a bug in a package or binary you're using. Please get in touch if this keeps happening and you're unsure why.",internal-packages/database/prisma/schema.prisma (1)
1967-1971
: Consider enhancing enum documentation for consistency.The status categorization is good, but consider adding descriptive comments for each status similar to TaskRunStatus enum for better consistency and clarity.
Add descriptive comments for each status:
/// NON-FINAL +/// Task is waiting to be executed PENDING +/// Task is currently being executed EXECUTING +/// Task has been paused and can be resumed PAUSED /// FINAL +/// Task has failed to complete due to an error FAILED +/// Task has been canceled by the user CANCELED +/// Task has completed successfully COMPLETED
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
- apps/webapp/app/v3/failedTaskRun.server.ts (3 hunks)
- internal-packages/database/prisma/schema.prisma (3 hunks)
- packages/core/src/v3/errors.ts (6 hunks)
🧰 Additional context used
📓 Learnings (1)
apps/webapp/app/v3/failedTaskRun.server.ts (1)
Learnt from: nicktrn PR: triggerdotdev/trigger.dev#1424 File: apps/webapp/app/v3/failedTaskRun.server.ts:136-139 Timestamp: 2024-10-22T10:34:42.602Z Learning: In the `apps/webapp` project, it's acceptable to use private class methods with `#` syntax in TypeScript code.
🪛 Biome
packages/core/src/v3/errors.ts
[error] 155-188: This case is falling through to the next case.
Add a
break
orreturn
statement to the end of this case to prevent fallthrough.(lint/suspicious/noFallthroughSwitchClause)
🔇 Additional comments (9)
apps/webapp/app/v3/failedTaskRun.server.ts (5)
1-30
: LGTM! Well-structured imports and type definitions.The imports are properly scoped, and the type definitions are clear and well-organized. The
includeAttempts
constant provides a clean structure for retrieving attempts with proper ordering.
76-79
: Referring to previous review comment about explicit interfaces.
81-134
: LGTM! Well-structured retry handling implementation.The class provides a clean implementation of retry logic with proper error handling and logging.
257-266
:⚠️ Potential issueImprove delay validation to handle zero delays correctly.
The current condition
if (!delay)
will treat a delay of 0 (immediate retry) as if there were no more retries.- if (!delay) { + if (delay === undefined || delay === null) {Likely invalid or redundant comment.
283-284
:⚠️ Potential issueUpdate the TODO comment with the correct version.
The TODO comment indicates that the version needs to be updated. This is important as it affects the retry configuration behavior.
packages/core/src/v3/errors.ts (3)
153-202
: LGTM! Well-structured error retry logic.The implementation provides a clear and exhaustive categorization of errors into retryable and non-retryable cases. The use of
assertExhaustive
ensures type safety by catching any missing error codes at compile time.🧰 Tools
🪛 Biome
[error] 155-188: This case is falling through to the next case.
Add a
break
orreturn
statement to the end of this case to prevent fallthrough.(lint/suspicious/noFallthroughSwitchClause)
448-464
: Previous regex suggestion is still applicable.
Line range hint
471-595
: Previous suggestion to extract common signal handling logic is still applicable.internal-packages/database/prisma/schema.prisma (1)
Line range hint
1805-1832
: Well-structured enum with clear status categorization!The separation of statuses into "NON-FINAL" and "FINAL" categories with clear documentation improves code maintainability. The new statuses (RETRYING_AFTER_FAILURE, CRASHED, etc.) effectively support the enhanced error handling and retry mechanisms.
Other changes:
Things to test:
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Style
Tests