Use async call for disk.dump #2571

slaperche-scality · 2020-05-25T13:51:21Z

Component:

operator

Context:

Sometimes the call to disk.dump fails because we have another state running concurrent and this permanently put the volume in a failed state.

Summary:

Given the lack of details in Salt-API response (the server return an error 500 with a generic message, not much you can infer from that…) the simplest solution is to make the call to disk.dump an async one (that way we leverage the existing handling of concurrent states).
As a side bonus, we also no longer need a timeout (which wasn't handled properly in our case…).

The biggest change is the enrichment of the job ID (which becomes a Job Handle).

Acceptance criteria:

No regression (existing tests still pass)
Cannot reproduce the described issue.

Closes: #2493

Until now, we only had a single Salt async job per step (one to prepare, one to finalize), thus we knew which job was running by knowing the step we were in. This is gonna change, as we want to call `disk.dump` in an async fashion and thus we will have two async jobs for the prepare step. In order to know which is which, we now carry the job name alongside its ID and bundle this up into a JobHandle. This commit only introduces the JobHandle and uses it instead of the JobID. Making `disk.dump` async will be done in another commit. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

This allows to act on the result of the job. Unused for now, but we will exploit it when `disk.dump` is made async. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

The Salt client code already handles the retcode, the caller is only interested in the returned payload. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

The sync call to `disk.dump` was flaky for two reasons: - we had a 1s timeout, but no logic to retry on timeout error => volume creation could fail on "slow" platform - Salt API doesn't handle well error triggered by concurrent execution of Salt state when you make an sync call (an error 500 with no details is returned), thus we couldn't retry. For those reasons, we now use an async call to retrieve the volume size. Closes: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

bert-e · 2020-05-25T13:51:23Z

Hello slaperche-scality,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Status report is not available.

bert-e · 2020-05-25T13:51:25Z

Branches have diverged

This pull request's source branch bugfix/2493-fix-call-to-disk-dump has diverged from
development/2.6 by more than 50 commits.

To avoid any integration risks, please re-synchronize them using one of the
following solutions:

Merge origin/development/2.6 into bugfix/2493-fix-call-to-disk-dump
Rebase bugfix/2493-fix-call-to-disk-dump onto origin/development/2.6

Note: If you choose to rebase, you may have to ask me to rebuild
integration branches using the reset command.

bert-e · 2020-05-25T14:14:17Z

Integration data created

I have created the integration data for the additional destination branches.

this pull request will merge bugfix/2493-fix-call-to-disk-dump into
development/2.5
w/2.6/bugfix/2493-fix-call-to-disk-dump will be merged into development/2.6

The following branches will NOT be impacted:

development/1.0
development/1.1
development/1.2
development/1.3
development/2.0
development/2.1
development/2.2
development/2.3
development/2.4

You can set option create_pull_requests if you need me to create
integration pull requests in addition to integration branches, with:

@bert-e create_pull_requests

bert-e · 2020-05-25T14:14:19Z

Waiting for approval

The following approvals are needed before I can proceed with the merge:

the author
one peer

Peer approvals must include at least 1 approval from the following list:

NicolasT

LGTM but would like more review.

gdemonet

I'm a bit curious: do we still need a special JOB_DONE_MARKER JID value? Wouldn't adding an extra field about the job status to the JobHandle struct make things simpler?

Also, do we have "migration" tests? Like what happens if the Operator is updated and picks up an old Volume with an active Job in the old format? Is your salt.JobFromString method handling old-style JIDs?

slaperche-scality · 2020-05-26T12:52:28Z

I'm a bit curious: do we still need a special JOB_DONE_MARKER JID value? Wouldn't adding an extra field about the job status to the JobHandle struct make things simpler?

As we need to serialize it as a string in the end (to store in in the CR's status), we wouldn't gain much.
On one hand, it may make the code a bit simpler since you can if job.IsDone (not sure it's a real benefit, since we still need a to test the ID for the empty/non-empty case anyway).
On the other hand, we would need an extra-field in the string form when we serialize/parse it back and forth (e.g.: name/ID/isDone/result).

I think the current solution is OK, albeit hackish since I'm piggybacking stuff into the string field.

But if we really have more needs, I would start over and have a structured Job field in the CR's status (but I think this should be done in a new 2.X release since it's changing the CRD, not in a patch release).

Also, do we have "migration" tests? Like what happens if the Operator is updated and picks up an old Volume with an active Job in the old format? Is your salt.JobFromString method handling old-style JIDs?

Currently no.
Current behavior is: Volume will goes in Failed state, since we can't parse back the JID, and will need to be recreated.
I think it should be unlikely enough to be acceptable.

It we really want to support it, it may be doable will make the code a bit more complex (because you need to find, somehow, if the current job is Prepare or Unprepare).

gdemonet

OK thanks for the explanation. Indeed, it's easier to keep the current approach for now. In any case, the approach for "polling Salt jobs" may change in the future depending on how we make it evolve.
And for migration, I guess it's OK as long as we mention it in release notes.

slaperche-scality · 2020-05-27T07:04:26Z

/approve

bert-e · 2020-05-27T07:04:38Z

In the queue

The changeset has received all authorizations and has been added to the
relevant queue(s). The queue(s) will be merged in the target development
branch(es) as soon as builds have passed.

The changeset will be merged in:

✔️ development/2.5
✔️ development/2.6

The following branches will NOT be impacted:

development/1.0
development/1.1
development/1.2
development/1.3
development/2.0
development/2.1
development/2.2
development/2.3
development/2.4

There is no action required on your side. You will be notified here once
the changeset has been merged. In the unlikely event that the changeset
fails permanently on the queue, a member of the admin team will
contact you to help resolve the matter.

IMPORTANT

Please do not attempt to modify this pull request.

Any commit you add on the source branch will trigger a new cycle after the
current queue is merged.
Any commit you add on one of the integration branches will be lost.

If you need this pull request to be removed from the queue, please contact a
member of the admin team now.

The following options are set: approve

bert-e · 2020-05-27T13:12:34Z

I have successfully merged the changeset of this pull request
into targetted development branches:

✔️ development/2.5
✔️ development/2.6

The following branches have NOT changed:

development/1.0
development/1.1
development/1.2
development/1.3
development/2.0
development/2.1
development/2.2
development/2.3
development/2.4

Please check the status of the associated issue None.

Goodbye slaperche-scality.

slaperche-scality added 4 commits May 25, 2020 11:21

storage-operator/controller: use callback on Salt job success

df2bf62

This allows to act on the result of the job. Unused for now, but we will exploit it when `disk.dump` is made async. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

storage-operator/salt: return the job return value directly

a782bfc

The Salt client code already handles the retcode, the caller is only interested in the returned payload. Refs: #2493 Signed-off-by: Sylvain Laperche <[email protected]>

slaperche-scality requested a review from a team as a code owner May 25, 2020 13:51

slaperche-scality changed the base branch from development/2.6 to development/2.5 May 25, 2020 14:14

NicolasT reviewed May 26, 2020

View reviewed changes

gdemonet reviewed May 26, 2020

View reviewed changes

gdemonet approved these changes May 27, 2020

View reviewed changes

bert-e merged commit 4dd09ec into development/2.5 May 27, 2020

bert-e deleted the bugfix/2493-fix-call-to-disk-dump branch May 27, 2020 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use async call for disk.dump #2571

Use async call for disk.dump #2571

slaperche-scality commented May 25, 2020

bert-e commented May 25, 2020

bert-e commented May 25, 2020

bert-e commented May 25, 2020

bert-e commented May 25, 2020

NicolasT left a comment

gdemonet left a comment

slaperche-scality commented May 26, 2020

gdemonet left a comment

slaperche-scality commented May 27, 2020

bert-e commented May 27, 2020

bert-e commented May 27, 2020

Use async call for disk.dump #2571

Use async call for disk.dump #2571

Conversation

slaperche-scality commented May 25, 2020

bert-e commented May 25, 2020

Hello slaperche-scality,

bert-e commented May 25, 2020

Branches have diverged

bert-e commented May 25, 2020

Integration data created

bert-e commented May 25, 2020

Waiting for approval

NicolasT left a comment

Choose a reason for hiding this comment

gdemonet left a comment

Choose a reason for hiding this comment

slaperche-scality commented May 26, 2020

gdemonet left a comment

Choose a reason for hiding this comment

slaperche-scality commented May 27, 2020

bert-e commented May 27, 2020

In the queue

bert-e commented May 27, 2020