-
-
Notifications
You must be signed in to change notification settings - Fork 623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix several restore and resume bugs #1418
Conversation
🦋 Changeset detectedLatest commit: cdbf5c6 The changes in this PR will be included in the next version bump. This PR includes changesets to release 9 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Caution Review failedThe pull request is closed. WalkthroughThe pull request introduces several modifications across multiple files, primarily enhancing error handling, logging, and cleanup processes. Key changes include updates to the Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
commit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Outside diff range and nitpick comments (6)
apps/webapp/app/v3/services/restoreCheckpoint.server.ts (1)
116-134
: Approve with suggestions for minor improvementsThe new
getLastCheckpointEventIfUnrestored
method is a valuable addition that aligns with the PR objectives. It provides a way to retrieve the last checkpoint event, which can be useful for addressing restore and resume bugs. The implementation is concise and uses the Prisma ORM consistently with the rest of the class.Here are a few suggestions to further improve the method:
Consider adding error handling for the database query. This will make the method more robust and easier to debug.
The method name might be slightly misleading as it doesn't explicitly check if the checkpoint is unrestored. Consider either renaming the method to something like
getLastCheckpointEvent
or adding a comment to explain its behavior.Add a return type annotation to improve code clarity. For example:
async getLastCheckpointEventIfUnrestored(runId: string): Promise<CheckpointRestoreEvent | undefined> { // ... existing implementation ... }With these minor improvements, the method will be even more robust and maintainable.
apps/webapp/app/v3/services/resumeAttempt.server.ts (3)
22-22
: Excellent logging improvements with a minor suggestionThe changes significantly enhance the logging mechanism:
- Consistent use of the encapsulated
_logger
.- Creation of a child logger with rich context (attemptId, attemptFriendlyId, taskRun).
These improvements will greatly aid in debugging and tracing issues.
Consider moving the child logger creation to the beginning of the method, right after the attempt is found. This would ensure all logs within the method have the additional context:
if (!attempt) { this._logger.error("Could not find attempt", params); return; } this._logger = this._logger.child({ attemptId: attempt.id, attemptFriendlyId: attempt.friendlyId, taskRun: attempt.taskRun, }); // Rest of the method...Also applies to: 81-81, 85-90
164-164
: Great logging enhancements with a suggestionThe changes continue to improve the logging mechanism:
- Consistent use of
this._logger
for error logging.- Creation of a child logger with additional context for each completed attempt.
These improvements will significantly aid in tracing and debugging issues related to individual completed attempts.
Consider renaming the
logger
variable to avoid shadowing the class property:const completedAttemptLogger = this._logger.child({ completedAttemptId: completedAttempt.id, completedAttemptFriendlyId: completedAttempt.friendlyId, completedRunId: completedAttempt.taskRunId, });This makes it clearer that we're creating a new logger instance for the completed attempt.
Also applies to: 187-187, 192-197
Line range hint
1-255
: Suggestions for further improvementsThe changes in this file significantly improve the logging and error handling. To further enhance the code quality and performance, consider the following suggestions:
Unit Tests: Given the complexity of the logic in this file, especially around resuming attempts and handling dependencies, it would be beneficial to add or expand unit tests. This will help ensure the correctness of the logic and make future refactoring easier.
Performance Optimization: The file contains several database operations, some of which are within loops. Consider reviewing these operations for potential performance optimizations. For example:
- Can any of the database queries be combined or optimized?
- Is there an opportunity to use batch operations for multiple updates?
- Could any of the database operations benefit from caching frequently accessed data?
Error Handling: While the error logging has been improved, consider adding more specific error types or error codes. This could help in better categorizing and handling different types of errors that may occur during the resume process.
Would you like assistance in identifying specific areas for additional unit tests or performance optimizations?
apps/coordinator/src/exec.ts (1)
67-78
: Improved metadata logging enhances debugging capabilities.The addition of
globalOpts
,localOpts
, and explicit output properties in the metadata object significantly improves the logging capabilities of thex
method. This change provides more context for debugging and monitoring command executions.Consider adding a
timestamp
field to the metadata object to further enhance debugging capabilities. This can be useful for tracking execution times and identifying potential performance issues.metadata = { // ... existing fields ... + timestamp: new Date().toISOString(), };
apps/webapp/app/v3/handleSocketIo.server.ts (1)
198-201
: LGTM! Consider destructuring for consistency.The changes to
getExecutionPayloadFromAttempt
look good. The new object parameter style improves readability and flexibility. This aligns well with the PR's objectives of addressing restore and resume bugs.For consistency with the surrounding code style, consider using object destructuring:
const payload = await sharedQueueTasks.getExecutionPayloadFromAttempt({ id: attempt.id, setToExecuting: true, });This minor change would align with modern JavaScript/TypeScript practices and maintain consistency with other parts of the codebase.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (11)
- apps/coordinator/src/checkpointer.ts (1 hunks)
- apps/coordinator/src/exec.ts (1 hunks)
- apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx (3 hunks)
- apps/webapp/app/v3/environmentVariables/environmentVariablesRepository.server.ts (1 hunks)
- apps/webapp/app/v3/handleSocketIo.server.ts (1 hunks)
- apps/webapp/app/v3/marqs/sharedQueueConsumer.server.ts (13 hunks)
- apps/webapp/app/v3/requeueTaskRun.server.ts (2 hunks)
- apps/webapp/app/v3/services/restoreCheckpoint.server.ts (1 hunks)
- apps/webapp/app/v3/services/resumeAttempt.server.ts (6 hunks)
- packages/core/src/v3/errors.ts (5 hunks)
- packages/core/src/v3/schemas/common.ts (2 hunks)
✅ Files skipped from review due to trivial changes (1)
- apps/webapp/app/v3/environmentVariables/environmentVariablesRepository.server.ts
🧰 Additional context used
🔇 Additional comments (21)
apps/webapp/app/v3/services/resumeAttempt.server.ts (3)
11-11
: LGTM: Improved modularity and encapsulationThe changes here improve the code structure:
- Importing
sharedQueueTasks
enhances modularity.- Introducing a private
_logger
property allows for better encapsulation of logging behavior.These modifications align with good software engineering practices.
Also applies to: 17-18
92-92
: LGTM: Consistent logger usageThese changes consistently apply the use of the encapsulated
_logger
throughout the method. This ensures that all log messages benefit from the additional context provided by the child logger, improving traceability and debugging capabilities.Also applies to: 100-100, 117-117, 123-123, 137-137, 143-143
203-203
: Good improvements with a request for clarificationThe changes enhance the code in several ways:
- Consistent use of the context-rich logger for error logging.
- Updated method call to
getExecutionPayloadFromAttempt
with more explicit parameters.Could you please clarify the implications of the
skipStatusChecks: true
parameter? While the comment suggests this is an optimization, it would be helpful to understand:
- What status checks are being skipped?
- Are there any potential risks associated with skipping these checks?
- Is there a way to verify that skipping these checks is always safe in this context?
Consider adding a more detailed comment explaining the rationale behind this optimization.
Also applies to: 210-213, 216-216
packages/core/src/v3/schemas/common.ts (2)
92-92
: Approved: New error codes enhance error handling capabilitiesThe addition of
"TASK_PROCESS_SIGTERM"
and"TASK_RUN_HEARTBEAT_TIMEOUT"
to theTaskRunInternalError
enum aligns well with the PR objectives. These new error codes will improve the system's ability to detect and handle specific error scenarios, such as task exits due to SIGTERM signals and potential timeout issues that could lead to "frozen" states.Also applies to: 98-98
112-113
: Approved: Improved error code managementThe redefinition of
TaskRunErrorCodes
to derive its values directly fromTaskRunInternalError.shape.code.enum
is a smart refactoring. This change ensures thatTaskRunErrorCodes
always stays in sync with the error codes defined inTaskRunInternalError
, reducing the risk of inconsistencies and simplifying future maintenance.apps/webapp/app/v3/handleSocketIo.server.ts (1)
Line range hint
1-424
: Changes align well with PR objectivesThe modification to the
CREATE_TASK_RUN_ATTEMPT
handler in this file is a targeted change that aligns well with the PR's objective of addressing restore and resume bugs. The update togetExecutionPayloadFromAttempt
appears to be part of a broader effort to improve the handling of task run attempts, which could contribute to resolving issues with runs becoming stuck or failing to resume properly.The localized nature of the change minimizes the risk of unintended side effects while potentially improving the system's ability to manage task run attempts effectively. This update seems to be a positive step towards achieving the goals outlined in the PR summary.
apps/webapp/app/v3/requeueTaskRun.server.ts (2)
9-9
: Import of 'socketIo' added correctlyThe import statement for
socketIo
is properly added, enabling socket communication within this file.
107-107
: Review the logic for 'delayInMs' computationThe expression for
delayInMs
:delayInMs: taskRun.lockedToVersion?.supportsLazyAttempts ? 5_000 : undefined,implies that if
supportsLazyAttempts
istrue
,delayInMs
is5000
, otherwise it'sundefined
. Confirm that this aligns with the intended behavior, especially ifsupportsLazyAttempts
can beundefined
.If the intention is to have no delay when
supportsLazyAttempts
isfalse
orundefined
, this logic is correct. Otherwise, you might consider setting a default value.packages/core/src/v3/errors.ts (1)
389-396
: Consistent use of optional chaining forerror.message
In this block, you've correctly used optional chaining with
error.message?.includes("SIGTERM")
. This ensures that iferror.message
isundefined
, it won't cause a runtime error.apps/coordinator/src/checkpointer.ts (1)
439-442
: Prevent unintended removal of controllers incleanup
By checking that
this.#abortControllers.get(runId)
equals the currentcontroller
before deleting, you ensure that only the controller associated with the current operation is removed. This prevents accidentally deleting controllers that may still be in use by other processes.apps/webapp/app/v3/marqs/sharedQueueConsumer.server.ts (11)
24-24
: Usage of$replica
for read operationsIntroducing
$replica
for read operations in the#heartbeat
method is appropriate to reduce the load on the primary database. Make sure that any potential for stale data does not negatively impact the application's behavior.
46-51
: Imported final status constants and utility functions correctlyThe constants
FINAL_ATTEMPT_STATUSES
,FINAL_RUN_STATUSES
, and the functionsisFinalAttemptStatus
,isFinalRunStatus
are correctly imported from../taskStatus
and are used appropriately in status checks throughout the code.
628-630
: Proper status check usingnotIn
withFINAL_RUN_STATUSES
The query correctly filters out runs with final statuses using
status: { notIn: FINAL_RUN_STATUSES }
, ensuring only active runs are processed.
644-650
: Verify resuming runs not in 'EXECUTING' statusThe code logs a warning when
resumableRun.status
is not'EXECUTING'
but proceeds to attempt a resume. Ensure that resuming runs in other statuses won't lead to unexpected behavior or conflicts.Consider verifying if additional statuses should be handled differently before attempting to resume.
Line range hint
737-870
: Enhanced error handling and retry logic during resumptionThe updates improve error handling by logging detailed warnings and attempting to restore checkpoints when resumption fails. Ensure that the retry mechanism does not result in infinite loops in cases of persistent failures.
Consider implementing a retry limit to prevent potential infinite loops due to continuous failures.
995-995
: Correct usage ofFINAL_ATTEMPT_STATUSES
in status checkThe condition
status: { in: FINAL_ATTEMPT_STATUSES }
accurately checks for attempts with final statuses, ensuring only completed attempts are processed further.
1041-1051
: Updated method signature with named parametersThe
getExecutionPayloadFromAttempt
method now accepts an object with named parameters, including the newskipStatusChecks
option. This enhances the method's flexibility and readability.
1084-1106
: Conditional status checks based onskipStatusChecks
Introducing the
skipStatusChecks
flag allows conditional execution of status validations. The status checks are properly scoped within theif (!skipStatusChecks)
block, and the switch cases are correctly structured.
1257-1261
: Adjusted method call to align with updated signatureThe call to
getExecutionPayloadFromAttempt
now uses named parameters, aligning with the updated method signature. This change improves code clarity and reduces the risk of passing incorrect arguments.
1336-1342
: Refactored heartbeat logic into a private method#heartbeat
Consolidating the heartbeat functionality into the
#heartbeat
private method reduces code duplication and simplifies maintenance. The methodstaskHeartbeat
andtaskRunHeartbeat
appropriately delegate to this new method.
1353-1411
: Efficient heartbeat handling with read replica and run cancellationThe
#heartbeat
method efficiently uses$replica
to reduce load on the primary database. It correctly handles final run statuses by emitting aREQUEST_RUN_CANCELLATION
event, ensuring that leftover processes are terminated appropriately.
...s.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx
Show resolved
Hide resolved
* try to correct resume messages with missing checkpoint * prevent creating checkpoints for outdated task waits * prevent creating checkpoints for outdated batch waits * use heartbeats to check for and clean up any leftover containers * lint * improve exec logging * improve resume attempt logs * fix for resuming parents of canceled child runs * separate SIGTERM from maybe OOM errors * pretty errors can have magic dashboard links * prevent uncancellable checkpoints * simplify task run error code enum export * grab the last, not the first child run * Revert "prevent creating checkpoints for outdated batch waits" This reverts commit f2b5c2a. * Revert "grab the last, not the first child run" This reverts commit 89ec5c8. * Revert "prevent creating checkpoints for outdated task waits" This reverts commit 11066b4. * more logs for resume message handling * add magic error link comment * add changeset
* refactor finalize run service * refactor complete attempt service * remove separate graceful exit handling * refactor task status helpers * clearly separate statuses in prisma schema * all non-final statuses should be failable * new import payload error code * store default retry config if none set on task * failed run service now respects retries * fix merged task retry config indexing * some errors should never be retried * finalize run service takes care of acks now * execution payload helper now with single object arg * internal error code enum export * unify failed and crashed run retries * Prevent uncaught socket ack exceptions (#1415) * catch all the remaining socket acks that could possibly throw * wrap the remaining handlers in try catch * New onboarding question (#1404) * Updated “Twitter” to be “X (Twitter)” * added Textarea to storybook * Updated textarea styling to match input field * WIP adding new text field to org creation page * Added description to field * Submit feedback to Plain when an org signs up * Formatting improvement * type improvement * removed userId * Moved submitting to Plain into its own file * Change orgName with name * use sendToPlain function for the help & feedback email form * use name not orgName * import cleanup * Downgrading plan form uses sendToPlain * Get the userId from requireUser only * Added whitespace-pre-wrap to the message property on the run page * use requireUserId * Removed old Plain submit code * Added a new Context page for the docs (#1416) * Added a new context page with task context properties * Removed code comments * Added more crosslinks * Fix updating many environment variables at once (#1413) * Move code example to the side menu * New docs example for creating a HN email summary * doc: add instructions to create new reference project and run it locally (#1417) * doc: add instructions to create new reference project and run it locally * doc: Add instruction for running tunnel * minor language improvement * Fix several restore and resume bugs (#1418) * try to correct resume messages with missing checkpoint * prevent creating checkpoints for outdated task waits * prevent creating checkpoints for outdated batch waits * use heartbeats to check for and clean up any leftover containers * lint * improve exec logging * improve resume attempt logs * fix for resuming parents of canceled child runs * separate SIGTERM from maybe OOM errors * pretty errors can have magic dashboard links * prevent uncancellable checkpoints * simplify task run error code enum export * grab the last, not the first child run * Revert "prevent creating checkpoints for outdated batch waits" This reverts commit f2b5c2a. * Revert "grab the last, not the first child run" This reverts commit 89ec5c8. * Revert "prevent creating checkpoints for outdated task waits" This reverts commit 11066b4. * more logs for resume message handling * add magic error link comment * add changeset * chore: Update version for release (#1410) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * Release 3.0.13 * capture ffmpeg oom errors * respect maxAttempts=1 when failing before first attempt creation * request worker exit on fatal errors * fix error code merge * add new error code to should retry * pretty segfault errors * pretty internal errors for attempt spans * decrease oom false positives * fix timeline event color for failed runs * auto-retry packet import and export * add sdk version check and complete event while completing attempt * all internal errors become crashes by default * use pretty error helpers exclusively * error to debug log * zodfetch fixes * rename import payload to task input error * fix true non-zero exit error display * fix retry config parsing * correctly mark crashes as crashed * add changeset * remove non-zero exit comment * pretend we don't support default default retry configs yet --------- Co-authored-by: James Ritchie <[email protected]> Co-authored-by: shubham yadav <[email protected]> Co-authored-by: Tarun Pratap Singh <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Covers at least the follow scenarios:
(batch)triggerAndWait()
Also adds detection of task exits due to SIGTERM and a helpful error message.
Summary by CodeRabbit
Release Notes
New Features
RunError
component with the addition of aFeedback
option.Bug Fixes
Checkpointer
class for better robustness.Chores
ResumeAttemptService
for better encapsulation.EnvironmentVariablesRepository
for correct scoping.