-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Structured events in place of comment strings #3771
Conversation
Important Auto Review SkippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the To trigger a single review, invoke the WalkthroughThe modifications introduce a comprehensive overhaul to error handling and event logging across various components, with a focus on structured errors, enhanced job and execution history tracking, and refined event management. The changes aim to improve debugging, monitoring, and the user interface experience by providing more detailed and structured information on job executions and errors. Changes
Assessment against linked issues
Possibly related issues
Poem
TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
@coderabbitai review |
fc39cb6
to
4cdc5b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although I have added a couple of minor comments about impl/structure of the event struct.
1c0ec43
to
9998bc1
Compare
Co-authored-by: Ross Jones <[email protected]>
This PR implements the structure proposed in [Improve Error Reporting](https://www.notion.so/expanso/Improve-Error-Reporting-c19f5516822b47de980d76ff43ff4bbe) as a first step towards providing richer progress reporting during job execution. The "tl;dr;" is that we will move to using an event stream for reporting progress on jobs. The event stream will help users understand the progress of their job and give them extra context about any failures that occur. This will allow us to show a richer view in the UI, e.g. the user will be able to see "downloading Docker image" instead of just "job running". To achieve this vision, we need to build this infrastructure for generating events, recording them in the job history, and displaying them (done), replace the orchestrator/compute callbacks mechanism (later PR), and then give lower level components the ability to push events (later PR). This PR also includes some facility for structured error reporting. This allows low-level components to throw structured errors that provide a richer event than the ones generated automatically. This is used in e.g. the ErrNotEnoughNodes case and docker ImageUnavailable case so far. This gives us the ability to output hints as part of our messages back to the user: data:image/s3,"s3://crabby-images/93077/930772062f618e75c4baec306155e1dc7f195cdc" alt="carbon" The output of `describe` now shows a split history between the overall job and its executions: ``` % ./bin/darwin/arm64/bacalhau job describe j-66081fef-8dd2-48de-9997-bbe23a62f0be ID = j-66081fef-8dd2-48de-9997-bbe23a62f0be Name = Docker Job Namespace = default Type = batch State = Completed Count = 1 Created Time = 2024-04-10 06:56:26 Modified Time = 2024-04-10 06:56:29 Version = 0 Summary Completed = 1 Job History TIME REV. STATE TOPIC EVENT DETAILS 0s 1 Pending Submission Job submitted 2.618376s 2 Running 2.840423s 3 Completed Executions ID NODE ID STATE DESIRED REV. CREATED MODIFIED COMMENT e-5886e01f n-ffc3e455 Completed Stopped 6 8s ago 5s ago Accepted job Execution e-5886e01f History TIME REV. STATE TOPIC EVENT DETAILS 0s 1 New 8.165ms 2 AskForBid 2.569966s 3 AskForBidAccepted Requesting node Accepted job FailsExecution: false IsError: false Retryable: false 2.590902s 4 AskForBidAccepted 2.613668s 5 BidAccepted 2.803923s 6 Completed Standard Output 15 ``` Resolves bacalhau-project/expanso-planning#693. Resolves bacalhau-project/expanso-planning#694. ### TODO in this PR - [x] Add more documentation - [x] Sort execution histories by time DESC so that most relevant execution is first - [x] Do some more examples of using structured errors from compute node components --------- Co-authored-by: Ross Jones <[email protected]> Co-authored-by: Walid Baruni <[email protected]>
This PR implements the structure proposed in Improve Error Reporting as a first step towards providing richer progress reporting during job execution.
The "tl;dr;" is that we will move to using an event stream for reporting progress on jobs. The event stream will help users understand the progress of their job and give them extra context about any failures that occur. This will allow us to show a richer view in the UI, e.g. the user will be able to see "downloading Docker image" instead of just "job running".
To achieve this vision, we need to build this infrastructure for generating events, recording them in the job history, and displaying them (done), replace the orchestrator/compute callbacks mechanism (later PR), and then give lower level components the ability to push events (later PR).
This PR also includes some facility for structured error reporting. This allows low-level components to throw structured errors that provide a richer event than the ones generated automatically. This is used in e.g. the ErrNotEnoughNodes case and docker ImageUnavailable case so far.
This gives us the ability to output hints as part of our messages back to the user:
The output of
describe
now shows a split history between the overall job and its executions:Resolves https://github.com/bacalhau-project/expanso-planning/issues/693.
Resolves https://github.com/bacalhau-project/expanso-planning/issues/694.
TODO in this PR