-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use async call for disk.dump #2571
Conversation
Until now, we only had a single Salt async job per step (one to prepare, one to finalize), thus we knew which job was running by knowing the step we were in. This is gonna change, as we want to call `disk.dump` in an async fashion and thus we will have two async jobs for the prepare step. In order to know which is which, we now carry the job name alongside its ID and bundle this up into a JobHandle. This commit only introduces the JobHandle and uses it instead of the JobID. Making `disk.dump` async will be done in another commit. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
This allows to act on the result of the job. Unused for now, but we will exploit it when `disk.dump` is made async. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
The Salt client code already handles the retcode, the caller is only interested in the returned payload. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
The sync call to `disk.dump` was flaky for two reasons: - we had a 1s timeout, but no logic to retry on timeout error => volume creation could fail on "slow" platform - Salt API doesn't handle well error triggered by concurrent execution of Salt state when you make an sync call (an error 500 with no details is returned), thus we couldn't retry. For those reasons, we now use an async call to retrieve the volume size. Closes: #2493 Signed-off-by: Sylvain Laperche <[email protected]>
Hello slaperche-scality,My role is to assist you with the merge of this Status report is not available. |
Branches have divergedThis pull request's source branch To avoid any integration risks, please re-synchronize them using one of the
Note: If you choose to rebase, you may have to ask me to rebuild |
Integration data createdI have created the integration data for the additional destination branches.
The following branches will NOT be impacted:
You can set option
|
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
Peer approvals must include at least 1 approval from the following list:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but would like more review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit curious: do we still need a special JOB_DONE_MARKER
JID value? Wouldn't adding an extra field about the job status to the JobHandle
struct make things simpler?
Also, do we have "migration" tests? Like what happens if the Operator is updated and picks up an old Volume with an active Job in the old format? Is your salt.JobFromString
method handling old-style JIDs?
As we need to serialize it as a string in the end (to store in in the CR's status), we wouldn't gain much. I think the current solution is OK, albeit hackish since I'm piggybacking stuff into the string field. But if we really have more needs, I would start over and have a structured
Currently no. It we really want to support it, it may be doable will make the code a bit more complex (because you need to find, somehow, if the current job is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks for the explanation. Indeed, it's easier to keep the current approach for now. In any case, the approach for "polling Salt jobs" may change in the future depending on how we make it evolve.
And for migration, I guess it's OK as long as we mention it in release notes.
/approve |
In the queueThe changeset has received all authorizations and has been added to the The changeset will be merged in:
The following branches will NOT be impacted:
There is no action required on your side. You will be notified here once IMPORTANT Please do not attempt to modify this pull request.
If you need this pull request to be removed from the queue, please contact a The following options are set: approve |
I have successfully merged the changeset of this pull request
The following branches have NOT changed:
Please check the status of the associated issue None. Goodbye slaperche-scality. |
Component:
operator
Context:
Sometimes the call to
disk.dump
fails because we have another state running concurrent and this permanently put the volume in a failed state.Summary:
Given the lack of details in Salt-API response (the server return an error 500 with a generic message, not much you can infer from that…) the simplest solution is to make the call to
disk.dump
an async one (that way we leverage the existing handling of concurrent states).As a side bonus, we also no longer need a timeout (which wasn't handled properly in our case…).
The biggest change is the enrichment of the job ID (which becomes a Job Handle).
Acceptance criteria:
No regression (existing tests still pass)
Cannot reproduce the described issue.
Closes: #2493