Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use async call for disk.dump #2571

Merged
merged 4 commits into from
May 27, 2020
Merged

Conversation

slaperche-scality
Copy link
Contributor

Component:

operator

Context:

Sometimes the call to disk.dump fails because we have another state running concurrent and this permanently put the volume in a failed state.

Summary:

Given the lack of details in Salt-API response (the server return an error 500 with a generic message, not much you can infer from that…) the simplest solution is to make the call to disk.dump an async one (that way we leverage the existing handling of concurrent states).
As a side bonus, we also no longer need a timeout (which wasn't handled properly in our case…).

The biggest change is the enrichment of the job ID (which becomes a Job Handle).

Acceptance criteria:

No regression (existing tests still pass)
Cannot reproduce the described issue.


Closes: #2493

Until now, we only had a single Salt async job per step (one to prepare,
one to finalize), thus we knew which job was running by knowing the step
we were in.

This is gonna change, as we want to call `disk.dump` in an async fashion
and thus we will have two async jobs for the prepare step.
In order to know which is which, we now carry the job name alongside its
ID and bundle this up into a JobHandle.

This commit only introduces the JobHandle and uses it instead of the
JobID.
Making `disk.dump` async will be done in another commit.

Refs: #2493
Signed-off-by: Sylvain Laperche <[email protected]>
This allows to act on the result of the job.
Unused for now, but we will exploit it when `disk.dump` is made async.

Refs: #2493
Signed-off-by: Sylvain Laperche <[email protected]>
The Salt client code already handles the retcode, the caller is only
interested in the returned payload.

Refs: #2493
Signed-off-by: Sylvain Laperche <[email protected]>
The sync call to `disk.dump` was flaky for two reasons:
- we had a 1s timeout, but no logic to retry on timeout error => volume
  creation could fail on "slow" platform
- Salt API doesn't handle well error triggered by concurrent execution
  of Salt state when you make an sync call (an error 500 with no
  details is returned), thus we couldn't retry.

For those reasons, we now use an async call to retrieve the volume size.

Closes: #2493
Signed-off-by: Sylvain Laperche <[email protected]>
@slaperche-scality slaperche-scality requested a review from a team as a code owner May 25, 2020 13:51
@bert-e
Copy link
Contributor

bert-e commented May 25, 2020

Hello slaperche-scality,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Status report is not available.

@bert-e
Copy link
Contributor

bert-e commented May 25, 2020

Branches have diverged

This pull request's source branch bugfix/2493-fix-call-to-disk-dump has diverged from
development/2.6 by more than 50 commits.

To avoid any integration risks, please re-synchronize them using one of the
following solutions:

  • Merge origin/development/2.6 into bugfix/2493-fix-call-to-disk-dump
  • Rebase bugfix/2493-fix-call-to-disk-dump onto origin/development/2.6

Note: If you choose to rebase, you may have to ask me to rebuild
integration branches using the reset command.

@slaperche-scality slaperche-scality changed the base branch from development/2.6 to development/2.5 May 25, 2020 14:14
@bert-e
Copy link
Contributor

bert-e commented May 25, 2020

Integration data created

I have created the integration data for the additional destination branches.

The following branches will NOT be impacted:

  • development/1.0
  • development/1.1
  • development/1.2
  • development/1.3
  • development/2.0
  • development/2.1
  • development/2.2
  • development/2.3
  • development/2.4

You can set option create_pull_requests if you need me to create
integration pull requests in addition to integration branches, with:

@bert-e create_pull_requests

@bert-e
Copy link
Contributor

bert-e commented May 25, 2020

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • one peer

Peer approvals must include at least 1 approval from the following list:

Copy link
Contributor

@NicolasT NicolasT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but would like more review.

Copy link
Contributor

@gdemonet gdemonet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit curious: do we still need a special JOB_DONE_MARKER JID value? Wouldn't adding an extra field about the job status to the JobHandle struct make things simpler?

Also, do we have "migration" tests? Like what happens if the Operator is updated and picks up an old Volume with an active Job in the old format? Is your salt.JobFromString method handling old-style JIDs?

@slaperche-scality
Copy link
Contributor Author

I'm a bit curious: do we still need a special JOB_DONE_MARKER JID value? Wouldn't adding an extra field about the job status to the JobHandle struct make things simpler?

As we need to serialize it as a string in the end (to store in in the CR's status), we wouldn't gain much.
On one hand, it may make the code a bit simpler since you can if job.IsDone (not sure it's a real benefit, since we still need a to test the ID for the empty/non-empty case anyway).
On the other hand, we would need an extra-field in the string form when we serialize/parse it back and forth (e.g.: name/ID/isDone/result).

I think the current solution is OK, albeit hackish since I'm piggybacking stuff into the string field.

But if we really have more needs, I would start over and have a structured Job field in the CR's status (but I think this should be done in a new 2.X release since it's changing the CRD, not in a patch release).

Also, do we have "migration" tests? Like what happens if the Operator is updated and picks up an old Volume with an active Job in the old format? Is your salt.JobFromString method handling old-style JIDs?

Currently no.
Current behavior is: Volume will goes in Failed state, since we can't parse back the JID, and will need to be recreated.
I think it should be unlikely enough to be acceptable.

It we really want to support it, it may be doable will make the code a bit more complex (because you need to find, somehow, if the current job is Prepare or Unprepare).

Copy link
Contributor

@gdemonet gdemonet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks for the explanation. Indeed, it's easier to keep the current approach for now. In any case, the approach for "polling Salt jobs" may change in the future depending on how we make it evolve.
And for migration, I guess it's OK as long as we mention it in release notes.

@slaperche-scality
Copy link
Contributor Author

/approve

@bert-e
Copy link
Contributor

bert-e commented May 27, 2020

In the queue

The changeset has received all authorizations and has been added to the
relevant queue(s). The queue(s) will be merged in the target development
branch(es) as soon as builds have passed.

The changeset will be merged in:

  • ✔️ development/2.5

  • ✔️ development/2.6

The following branches will NOT be impacted:

  • development/1.0
  • development/1.1
  • development/1.2
  • development/1.3
  • development/2.0
  • development/2.1
  • development/2.2
  • development/2.3
  • development/2.4

There is no action required on your side. You will be notified here once
the changeset has been merged. In the unlikely event that the changeset
fails permanently on the queue, a member of the admin team will
contact you to help resolve the matter.

IMPORTANT

Please do not attempt to modify this pull request.

  • Any commit you add on the source branch will trigger a new cycle after the
    current queue is merged.
  • Any commit you add on one of the integration branches will be lost.

If you need this pull request to be removed from the queue, please contact a
member of the admin team now.

The following options are set: approve

@bert-e
Copy link
Contributor

bert-e commented May 27, 2020

I have successfully merged the changeset of this pull request
into targetted development branches:

  • ✔️ development/2.5

  • ✔️ development/2.6

The following branches have NOT changed:

  • development/1.0
  • development/1.1
  • development/1.2
  • development/1.3
  • development/2.0
  • development/2.1
  • development/2.2
  • development/2.3
  • development/2.4

Please check the status of the associated issue None.

Goodbye slaperche-scality.

@bert-e bert-e merged commit 4dd09ec into development/2.5 May 27, 2020
@bert-e bert-e deleted the bugfix/2493-fix-call-to-disk-dump branch May 27, 2020 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle error on disk.dump Salt call
4 participants