Clarify the state of an instance after a failed update or delete #570

fmui · 2018-08-07T09:19:00Z

See issue #566.

Update

… or deletion operation

cfdreddbot · 2018-08-07T09:19:02Z

Hey fmui!

Thanks for submitting this pull request! I'm here to inform the recipients of the pull request that you and the commit authors have already signed the CLA.

duglin · 2018-08-07T13:57:20Z

spec.md

@@ -782,6 +782,7 @@ For success responses, the following fields are defined:
 | --- | --- | --- |
 | state* | string | Valid values are `in progress`, `succeeded`, and `failed`. While `"state": "in progress"`, the Platform SHOULD continue polling. A response with `"state": "succeeded"` or `"state": "failed"` MUST cause the Platform to cease polling. |
 | description | string | A user-facing message that can be used to tell the user details about the status of the operation. If present, MUST be a non-empty string. |
+| instance_corrupt | boolean | For failed update and deprovisioning operations, this field indicates whether the instance is still usable or not. If the value is `true`, the Service Instance MUST be considered corrupt and the Platform SHOULD NOT allow the creation of new bindings. If the value is `false`, the Service Instance is in an unmodified and usable state. The default is true. |


Last sentence: When not specified, the default value is true.

If we head down this path, it seems like we'd need to support returning this value in the sync case too, right?

While I would like to have the spec be precise about the state of the resource upon failure, for backwards compatibility can we? I'm wondering if the last sentence should be:
If not specified, the state of the resource is unspecified by this specification.

The sync case is covered by HTTP status codes. 4xx means the instance is still usable, 5xx means the instance is corrupt.

You are right with the last sentence. But from a practical point of view, it makes no difference. The platform must treat the instance as corrupt.

ah I missed the correlation to the 400/500 response codes. Makes sense. thanks

duglin · 2018-08-17T16:25:28Z

LGTM

n3wscott · 2018-08-21T15:24:52Z

spec.md

@@ -782,6 +782,7 @@ For success responses, the following fields are defined:
 | --- | --- | --- |
 | state* | string | Valid values are `in progress`, `succeeded`, and `failed`. While `"state": "in progress"`, the Platform SHOULD continue polling. A response with `"state": "succeeded"` or `"state": "failed"` MUST cause the Platform to cease polling. |
 | description | string | A user-facing message that can be used to tell the user details about the status of the operation. If present, MUST be a non-empty string. |
+| instance_corrupt | boolean | For failed update and deprovisioning operations, this field indicates whether the instance is still usable or not. If the value is `true`, the Service Instance MUST be considered corrupt and the Platform SHOULD NOT allow the creation of new bindings. If the value is `false`, the Service Instance is in an unmodified and usable state. If not specified, the state of the resource is unspecified by this specification. |


why not add corrupt to the state value list?

or a new instance_state = string field?

What other states came to mind @n3wscott ?

I don't think we can add a new value to "state" because that would break existing implementations that are only expecting values from the list provided - unless we say that there are only 2 impls (CF and svccat). TBH I'm not sure we can really make that claim given I know of at least one other implementation of a platform that's not CF or SVCCAT.

@mattmcneeney rolled_back See #508

pmorie

Left a couple specific questions but I need to think more about this. I don't want to be nit-picky about names, but I'm a little concerned about people confusing this with some kind of health check. I wonder if it's more clear to call this unusable_after_update.

I think we should also talk about the 'delete' part of this. Do we need that? Do people expect their service instances to be usable after deletion today? In Kubernetes, once the system accepts the user's deletion request, there's no going back to an 'undeleted' state.

pmorie · 2018-08-21T15:48:46Z

spec.md

@@ -782,6 +782,7 @@ For success responses, the following fields are defined:
 | --- | --- | --- |
 | state* | string | Valid values are `in progress`, `succeeded`, and `failed`. While `"state": "in progress"`, the Platform SHOULD continue polling. A response with `"state": "succeeded"` or `"state": "failed"` MUST cause the Platform to cease polling. |
 | description | string | A user-facing message that can be used to tell the user details about the status of the operation. If present, MUST be a non-empty string. |
+| instance_corrupt | boolean | For failed update and deprovisioning operations, this field indicates whether the instance is still usable or not. If the value is `true`, the Service Instance MUST be considered corrupt and the Platform SHOULD NOT allow the creation of new bindings. If the value is `false`, the Service Instance is in an unmodified and usable state. If not specified, the state of the resource is unspecified by this specification. |


Are there invalid combinations of instance_corrupt and state? If so we should be explicit about them. Is it valid to see state: succeeded and instance_corrupt: true?

Can I update a corrupt instance?

The For failed update and deprovisioning operations... text implies it only applies when state="failed" but I agree we should be more explicit about it.

pmorie · 2018-08-21T15:50:34Z

spec.md

+Service Instance MUST be in an unmodified and usable state.
+
+Responses with any other status code MUST be interpreted as a failure.
+The Service Instance MUST be considered corrupt and the Platform SHOULD NOT


I think we should have a section on corrupt instances that explains:

what the meaning of 'corrupt' is

when you should set instance_corrupt as a broker author

whether corrupt instances can be updated

This "Responses with any other status code..." is kind of worrisome. If it said "Any other 5xx error..." then I agree. But what about some 3xx status code? If people agree I can open a new PR to address this since this isn't really part of this PR

@fmui did you want to address these comments?

actually, does "any other" mean non-4xx or does it mean "not mentioned above in the table or previous paragraph" ? I think we need to clarify this.

fmui · 2018-08-22T08:01:51Z

@pmorie Updates and deletes are not very different in this regard. In both cases, the operations fails if a pre-condition check fails. The broker may not have touched the instance at this point. It is still running as before.

Think of a document management repository that contains documents that must not be deleted for legal reasons. The deletion of the repository must fail. If the instance becomes unusable after such a failure, there is no handle for the end-user to get access to the repository anymore. It would require some admin intervention to either fix this or provide an alternative way to access the repository. That doesn't scale.

A WebHook can minimize the chance of such a situation, but it cannot prevent it. We have to be able to handle that for update and delete operations.

duglin · 2018-09-11T12:49:39Z

CI failure looks real

duglin · 2018-09-17T21:54:30Z

spec.md

+again, is undefined.
+Deprovisioning a corrupt instance SHOULD still be possible. A Platform MUST
+remember the Service Instance until it is successfully deprovisioned or it has
+been cleaned up as an orphan.


It really feels like we need some text here about the Platform performing orphan mitigation. What if we combine #598 into this PR so we get the text just right.

duglin · 2018-09-17T22:04:55Z

spec.md

@@ -782,6 +783,7 @@ For success responses, the following fields are defined:
 | --- | --- | --- |
 | state* | string | Valid values are `in progress`, `succeeded`, and `failed`. While `"state": "in progress"`, the Platform SHOULD continue polling. A response with `"state": "succeeded"` or `"state": "failed"` MUST cause the Platform to cease polling. |
 | description | string | A user-facing message that can be used to tell the user details about the status of the operation. If present, MUST be a non-empty string. |
+| instance_corrupt | boolean | For failed update and deprovisioning operations, this field indicates whether the instance is still usable or not. If the value is `true`, the Service Instance MUST be considered corrupt and the Platform SHOULD NOT allow the creation of new bindings. If the value is `false`, the Service Instance is in an unmodified and usable state. If not specified, the state of the resource is unspecified by this specification. |


I know this might sound kind of odd, but I thought I'd toss it out there... this "instance_corrupt" field is sort of the equivalent of returning a 5xx status code in the synchronous case. However, its not as useful because in the state=failed+instance_corrupt=false case we don't know whether its the equivalent of a 400 or 402. All we have is the description text, which is nice for humans but computers can't do a lot with it if its trying to make some decision based on 400 vs 402. The same applies to 5xx error, we don't know which one it is - if that's meaningful. Would it make more sense to make our sync and async flows more symmetrical and have a "status_code" field instead that's set to the HTTP status code that we would have returned if the flow was synchronous? Then we don't need to have special spec language to deal with two possible ways of getting back the status code from the op. The Platform side of it would be easier too because then there's just one type of field to check regardless of sync vs async.

(I assume you mean 422, not 402.)
If we go that route, we should also include the error field from the “Service Broker Errors” section.

The "instance_corrupt" for failed update concerns me a lot. The expectations of the failed update is that the state of resource will be rolled back to the previous successful state.

mattmcneeney · 2018-10-02T16:12:42Z

@fmui I think we've dropped the ball on this one a little. Anything we still need to do?

duglin · 2018-10-02T19:39:18Z

I think @fmui is reworking the PR to return a "statusCode" type of property that matches what the HTTP status code would be if it were synchronous. But I could be wrong.

fmui · 2018-10-05T15:30:11Z

I have updated the PR as discussed in one of our calls.

duglin · 2018-10-16T16:05:24Z

@fmui I'm sorry for missing this one for today's call - I'll add it to next week's.
All, please review when you get a chance.

mattmcneeney · 2018-10-23T16:54:21Z

@fmui I'm interested to hear what common status codes you'd expect to see returned (or if there even are common ones).

Also, I'm confused by the sentence matches what the HTTP status code would be if it were synchronous. We don't really mean that, right? This isn't to do with sync vs async; it's to do with allowing the broker to specify what error code it would return if it wasn't conforming to the /v2/service_instances/:guid/last_operation spec where it has to return a 200 OK along with a state field.

duglin · 2018-10-30T10:06:32Z

@fmui ^^^

fmui · 2018-11-06T08:39:00Z

@mattmcneeney The idea is to keep the logic on the broker and the platform the same for synchronous and asynchronous calls in case of an error. For synchronous provision calls, the spec defines the HTTP status codes 400, 409, and 422 (not applicable here). Additionally, we have to consider 5xx status codes as well. If an asynchronous provisioning fails, the status_code field should provide the appropriate status code from the list above. For example, an invalid parameter combination should result in the status code 400.

duglin · 2018-11-06T10:56:03Z

rebase needed

mattmcneeney · 2018-11-13T16:41:54Z

A number of changes have now been made to this PR, and it has now been rebased, so please can we get another round of reviews from @openservicebrokerapi/osbapi-pmc Thank you!

Samze · 2018-11-13T17:03:37Z

spec.md

+Service Instance MUST be in an unmodified and usable state.
+
+Responses with any other status code MUST be interpreted as a failure.
+The Service Instance MUST be considered corrupt and the Platform SHOULD NOT


I worry that if a Service Broker has an intermittent failure, and returns 500s for a period of time, then instances that receive update calls during this time will be marked as corrupt, leaving them unusable.

I would prefer the broker be explicit about when an instance is corrupt rather than inferring it. It seems kinda risky.

I completely agree with @Samze. I'm also starting to question how prevalent this problem is in the wild. Do service brokers ever leave instances as corrupt today? Do they allow this to happen or have they built preventative measures?

Doesn't this call into question the entire orphan mitigation strategy?

@Samze I agree with you, but that would be an incompatible change. The PR just clarifies the current spec, it doesn't change it. The current spec says, a status code 500 must be interpreted as a failure. Following the spec, orphan mitigation must kick in. Depending on the platform, there is shorter or longer period of time between the failure and orphan mitigation. Within this timeframe, the platform must already today consider the service instance as unusable. Otherwise, why should it trigger the orphan mitigation process later? So, the PR just spells out that platform should treat this instance as broken.

@mattmcneeney There are brokers that update in multiple steps. If one step fails, they can’t move forwards or backwards and then they wait for orphan mitigation. Sure, they could immediately clean up, but from a platform perspective this is not any different.

mattmcneeney · 2018-11-14T13:33:55Z

spec.md

 | description | string | A user-facing message that can be used to tell the user details about the status of the operation. |
+| status_code | number | The HTTP status code that would have been returned if the operation would have been executed synchronously. If the state is `failed` this field SHOULD be present and the value MUST be an integer in the range of 400 to 599. This field MUST NOT be present for any other state. |


This is the line that I don't like: The HTTP status code that would have been returned if the operation would have been executed synchronously.. That suggests that only the HTTP codes defined for a synchronous response are permitted.

What about An HTTP code that describes why the operation failed.?

Or, how about we explicitly call out the status codes that can be returned? This would help platforms build a better UX around this:
200 OK / 400 Bad Request / 409 Conflict seem appropriate from the Provision section.

That suggests that only the HTTP codes defined for a synchronous response are permitted.

The sync case's status codes can be anything, not just what's listed in the table - I don't think we need to list all possible values since we don't in the sync case. As for your proposed alternative text in the first comment, I'm ok with it but that's because I don't see the difference :-)

Thinking about this more... I do think it's important that the spec correlate the sync and async flows - meaning the status_code isn't just some random HTTP code - if the operation were invoked twice (once sync and once async) then typically people would see the same status_code in both cases, just in different locations/messages. W/o that I fear some people will not realize what we're trying to do and come up with different codes between the two flows.

mattmcneeney · 2018-11-14T14:12:25Z

spec.md

 | description | string | A user-facing message that can be used to tell the user details about the status of the operation. |
+| status_code | number | The HTTP status code that would have been returned if the operation would have been executed synchronously. If the state is `failed` this field SHOULD be present and the value MUST be an integer in the range of 400 to 599. This field MUST NOT be present for any other state. |
+| error | string | An error code as described in the [Service Broker Errors](#service-broker-errors) section. If present, MUST be a non-empty string. If the state is `failed` and there is an error code this field SHOULD be present. This field MUST NOT be present for any other state. |


Which error code are we actually expecting to get back here?
AsyncRequired = not applicable
ConcurrencyError = should be returned to the original request with a 422 as outlined in the Blocking Operations section (surely the broker can decide at that time whether or not the instance is already being operated on?)
RequiresApp = not applicable

You have listed the codes the spec defines and none of them might be applicable here. But a broker can make up its own error code if the spec doesn’t define one.
But the main reason for this field is for parity with an error messages of a synchronous call.

mattmcneeney · 2018-11-14T14:14:25Z

Sorry it took me so long to review this again @fmui. I've just added some comments, interested to get your feedback!

nilebox · 2018-11-30T20:36:24Z

@fmui another thing missing in the last operation response is whether the failure is retryable or not, see kubernetes-retired/service-catalog#2505

Service Catalog currently indefinitely retries after getting 200 response with state: failed response. It also performs orphan mitigation every time.

duglin · 2019-01-09T14:51:52Z

@fmui what's the status of this? I've been assuming we're waiting for an update, is that true?

fmui · 2019-01-11T12:47:43Z

@nilebox Interesting. I think create and delete operations should always be repeatable, but maybe not immediately because some human intervention may be required. Updates may not always be repeatable. The failed update might have damaged the instance beyond repair and the instance can only be deleted or has already been deleted by the broker.
So, do we need another flag retryable with the values no, immediately, and later?

nilebox · 2019-01-12T03:32:54Z

@fmui to me both "immediately" and "later" are just "retryable". If the update cannot be retried immediately, the broker will just keep returning state: failed along with retryable: true during this period.

Otherwise it's not clear what does "later" mean, and then we would have to return something like "retryAfter" with a timestamp, which the broker may not actually know as it cannot predict the future :)

fmui · 2019-01-15T12:06:58Z

@nilebox let me try different terms. Do we need a flag retryable with the values yes, no, and wait?
yes means retrying is possible. no means there is no point in retrying it. wait means don’t retry until this operation has been triggered again manually (or something similar).
My default example for the latter is a content repository with a legal hold. An unprovision request must fail because of the legal hold. Whether the repository should be eventually deleted or not is a business decision, which may take days. The is no point in retrying before the decision is made and it could also be decided not to delete the repository at all.
It is possible to merge no and wait into one value, but then we have to very clear in the spec that retrying this operation is not possible now, but retriggering the operation later (manually) may be successful.

nilebox · 2019-01-15T12:16:58Z

@fmui thanks for clarification, it makes more sense now. I agree that having distinction for an operation that can be manually retried may be useful, but it would probably make sense to move it to a separate flag, i.e. manually retryable operation would have a boolean flag retryable: true and some additional field retryCondition with values "auto" / "manual", or similar (i.e. some additional attribute that the marketplace/platform may choose to recognize or ignore).

duglin · 2019-01-15T12:52:58Z

Service Catalog currently indefinitely retries after getting 200 response with state: failed response. It also performs orphan mitigation every time.

FYI - We have customers that are complaining about this behavior. They have cases where they feel like their broker is being flooded with provision() requests even though the broker continually says "no" (state:failed). I believe that CF will stop on the first broker "no" - but perhaps @mattmcneeney can confirm. Either way, I think this PR will really help because with the status code the Platfrom (svc-cat) should now be able to determine whether to retry or not. Meaning, perhaps something like, retry on 5xx but not on 4xx.

nilebox · 2019-01-15T20:59:31Z

FYI - We have customers that are complaining about this behavior. They have cases where they feel like their broker is being flooded with provision() requests even though the broker continually says "no" (state:failed).

FYI, I have changed this behaviour recently, see kubernetes-retired/service-catalog#2520

But now we have an opposite problem: we treat all failed async operations as non-retryable, so having additional flag retryable: true would help there.

mattmcneeney · 2019-01-17T10:39:42Z

Hey @nilebox

I'm curious if you've had any feedback from service providers that they would like Platforms to retry operations after the last operation has returned "state": "failed".

@duglin mentioned that they have had feedback that this could happen if a broker was being flooded with requests, and I'm wondering if there are any other use cases we know of?

Thanks!

nilebox · 2019-01-17T11:24:03Z

I'm curious if you've had any feedback from service providers that they would like Platforms to retry operations after the last operation has returned "state": "failed".

@mattmcneeney yes, we do have teams in Atlassian who have implemented OSB API and have cases where retry after "state": "failed" would succeed. For example, if a cloud service ran out of capacity, the provisioning may fail, but after it scaled up or some resources freed up, it will succeed upon retry. Currently, users will need to manually force the retry via modifying some of the instance parameters.

Another example is self-healing in Kubernetes. If you ask to create a Pod with a Docker image that wasn't pushed by mistake, Kubernetes will keep retrying to pull the Docker image rather than giving up immediately. So to fix the mistake user just needs to push a Docker image, rather than updating Pod parameters (which are actually correct).

fmui · 2019-02-18T13:24:36Z

Replaced by PR #637.

fmui and others added 2 commits August 6, 2018 09:24

Merge pull request #1 from openservicebrokerapi/master

c6d21bf

Update

Added clarification of the state of an instance after a failed update…

0ce0e67

… or deletion operation

fmui mentioned this pull request Aug 7, 2018

Clarify the state of an instance after a failed update or delete #566

Open

duglin reviewed Aug 7, 2018

View reviewed changes

Clarified default of the instance_corrupt field.

0c1308a

edbartsch mentioned this pull request Aug 20, 2018

Proposal: Validating WebHooks for Create/Update/Delete of Instances & Bindings #572

Open

n3wscott reviewed Aug 21, 2018

View reviewed changes

pmorie reviewed Aug 21, 2018

View reviewed changes

Added section about corrupt instances

a8bc00f

duglin reviewed Sep 17, 2018

View reviewed changes

Fixed CI failure

550c6e5

Replaced instance_corrupt with status_code

549b855

Merge branch 'master' into instance-state

546cd23

Samze reviewed Nov 13, 2018

View reviewed changes

mattmcneeney reviewed Nov 14, 2018

View reviewed changes

duglin mentioned this pull request Nov 30, 2018

LastOperationRequest 200 response incorrectly triggers orphan mitigation kubernetes-retired/service-catalog#2505

Closed

duglin mentioned this pull request Dec 4, 2018

Orphan mitigation could be initiated by end users #598

Merged

fmui mentioned this pull request Feb 18, 2019

Allow Service Brokers to indicate the state of a Service Instance after a failed update or deprovisioning #637

Open

fmui closed this Feb 18, 2019

		\| description \| string \| A user-facing message that can be used to tell the user details about the status of the operation. \|
		\| status_code \| number \| The HTTP status code that would have been returned if the operation would have been executed synchronously. If the state is `failed` this field SHOULD be present and the value MUST be an integer in the range of 400 to 599. This field MUST NOT be present for any other state. \|

Clarify the state of an instance after a failed update or delete #570

Clarify the state of an instance after a failed update or delete #570

Conversation

fmui commented Aug 7, 2018

cfdreddbot commented Aug 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

duglin commented Aug 17, 2018 • edited by avade Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmorie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fmui commented Aug 22, 2018

duglin commented Sep 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattmcneeney commented Oct 2, 2018

duglin commented Oct 2, 2018

fmui commented Oct 5, 2018

duglin commented Oct 16, 2018

mattmcneeney commented Oct 23, 2018

duglin commented Oct 30, 2018

fmui commented Nov 6, 2018 • edited Loading

duglin commented Nov 6, 2018

mattmcneeney commented Nov 13, 2018

Samze Nov 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattmcneeney commented Nov 14, 2018

nilebox commented Nov 30, 2018

duglin commented Jan 9, 2019

fmui commented Jan 11, 2019

nilebox commented Jan 12, 2019

fmui commented Jan 15, 2019

nilebox commented Jan 15, 2019

duglin commented Jan 15, 2019

nilebox commented Jan 15, 2019

mattmcneeney commented Jan 17, 2019

nilebox commented Jan 17, 2019

fmui commented Feb 18, 2019

duglin commented Aug 17, 2018 •

edited by avade

Loading

fmui commented Nov 6, 2018 •

edited

Loading

Samze Nov 13, 2018 •

edited

Loading