-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating GEP-1742 to clarify timeout formatting #2155
Conversation
@robscott: GitHub didn't allow me to request PR reviews from the following users: frankbu, srodi, candita. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: robscott The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Adding a hold for consensus. /hold |
/retest |
@@ -364,7 +364,7 @@ sequenceDiagram | |||
|
|||
Both timeout fields are string duration values as specified by | |||
[Golang time.ParseDuration](https://pkg.go.dev/time#ParseDuration) and MUST be >= 1ms | |||
or 0 to disable (no timeout). | |||
and <= 100y. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments related to this in the API PR (https://github.com/kubernetes-sigs/gateway-api/pull/2013/files#r1245498559), but thinking about this more, I really don't see the value of specifying an upper limit.
The more I think about this, it seems that any value greater than a day or so, basically means no timeout. Certainly nobody would want to set a timeout to something like 5 years and really expect that the client and service will both be up and running in a paused state until it returns a value before the 5y deadline.
Unless I'm missing something, a user will want to configure the timeout to something relatively short, and any larger value really just mean "don't timeout", i.e., just hang until the client connection is terminated. So that's why I think the current design is right. User can set the timeout to some small value or 0 to disable timeouts. A user can also chose to set a pretty huge value which the implementation will likely support (up to the implementation limit of probably an int64 max value ns of ~300 years), but any specific number they chose that is large will work the same as 0 for all practical purposes. So documenting some arbitrary large max value is really meaningless IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more related thought. If we want to say anything about max value it could be something like the maximum value is implementation dependent, but MUST BE at least 1y
, or something like that.
That way the user can set it to any sensible value they'd like (>= 1ms) or 0 for infinite (don't timeout). Note that an implementation can implement this infinite (0) timeout by setting the timeout to it's internal max value which will have the same effect given that it's 1y or more.
@robscott btw, you said:
In Kubernetes APIs we always try to have a clear upper and lower bound for fields. This enables implementations to write tests that ensure they can support the min and max values for any field in the API.
How would an implementation test that a 10 year timeout works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would an implementation test that a 10 year timeout works?
Yeah, you're right, it's not really possible to write a test that waits this long, and any value that we can feasibly test is likely lower than at least some users would want. I think the real win here is that implementations can know that they support the full range of possible values allowed by the API just by understanding their dataplane.
If we want to say anything about max value it could be something like the maximum value is implementation dependent, but MUST BE at least 1y, or something like that.
I think that's reasonable, but I'm struggling to think of any use case that would benefit from > 1y, so distinguishing beyond that is likely not worthwhile (again I could be wrong here). My strong bias is to start with tighter limits and only loosen them if a clearly defined need is presented.
The more I think about this, it seems that any value greater than a day or so, basically means no timeout. Certainly nobody would want to set a timeout to something like 5 years and really expect that the client and service will both be up and running in a paused state until it returns a value before the 5y deadline.
Yeah I think that's probably accurate, I'm not sure what the right upper limit is here, but lower is probably better as a starting point.
So documenting some arbitrary large max value is really meaningless IMO.
There's a very strong precedent in Kubernetes API review to prevent unbounded values. Even if we were to push forward without an upper limit here, I can't imagine this passing the final level of API review before v0.8.0. Both @thockin and @khenidak have strongly pushed to avoid any unbounded values in this API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things:
-
String-encoded durations is not the norm in kube APIs, and I think we should question whether that's the right way to represent durations. The norm is
int32
with the units encoded in the name - e.g.timeoutSeconds int32
, and the unit is almost always seconds. -
It's not really unbounded if it is a float32 - there is a natural bound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point @thockin. I had thought of something like timeoutSeconds int32
as well, it would certainly simplify the validation initially. A concern I'd have is that some users will want to be able to express more precise timeouts, for example GCP and Envoy allow you to express timeouts in terms of nanoseconds, and most implementations support at least milliseconds as an input. That leaves us with a few options:
- Only support one unit (probably seconds or milliseconds) - this matches the approach of ingress-nginx (seconds), GCP BackendServices (seconds), Kong Ingress (ms), and HAProxy Ingress (ms)
- Support multiple units, combining the values (
timeout.seconds: 1
+timeout.nanoseconds: 5000000
==1.005s
). This roughly matches Envoy and GCP URLMaps which both appear to use proto duration. - Support something like
metav1.Duration
as proposed by this GEP. This matches Istio, Linkerd, Contour, and many of the underlying dataplanes like NGINX and HAProxy
So with all of that said, we could start with 1 and leave room for 2. For example, that future could look like this:
GEP Today:
timeouts:
request: 10s
GEP Tomorrow:
timeouts:
request:
seconds: 10
GEP Next Year:
timeouts:
request:
seconds: 10
nanoseconds: 500000
I'm not entirely sure that I prefer that over the more open-ended duration format, but it would be simpler to validate and more in line with other Kubernetes APIs. Does this make sense to you @thockin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind allowing 0, but in YAML it will have to be "0"
because the field will be transported as a string. 0s
does not suffer that problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can leave out unary + but keep unary -.
I would strongly prefer to leave that out from valid Gateway API values. Although we can define a broader type that allows negative values,
I'm talking about the difference between "Invalid value: timeout: must match regex ...." and "Invalid value: timeout: must be greater than or equal to 0ms"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind allowing 0, but in YAML it will have to be "0" because the field will be transported as a string. 0s does not suffer that problem.
Also not that comparing 0s
to 0ms
to 0h
must succeed, as must "1000ms" and 1s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm talking about the difference between "Invalid value: timeout: must match regex ...." and "Invalid value: timeout: must be greater than or equal to 0ms"
Unfortunately I don't know of any way to do that with CRD validation. We could do that with webhook validation, but that is far less reliable so I'd rather cover as much as possible directly with validation defined on the CRD.
Also not that comparing 0s to 0ms to 0h must succeed, as must "1000ms" and 1s
Agreed, we should likely have some conformance tests to confirm that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching up here, this is quite the rock to turn over!
I'm supportive of using a string based definition that's a subset of metav1.Duration
, however we do it, and also agree with ensuring that we have conformance tests that validate as much behavior as possible.
@@ -413,8 +417,14 @@ type HTTPRouteTimeouts struct { | |||
// may result in more than one call from the gateway to the destination backend service, | |||
// for example, if automatic retries are supported. | |||
// | |||
// Because the Request timeout encompasses the BackendRequest timeout, | |||
// the value of BackendRequest defaults to and must be <= the value of Request timeout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think defaulting BankendRequest to the value or Request is both easily implementable and a sensible approach. See my related comment here: https://github.com/kubernetes-sigs/gateway-api/pull/2013/files#r1245517080
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented in that thread first, copying over here for additional visibility:
Unfortunately we don't have a means to implement variable defaulting unless we want to create and maintain a defaulting webhook, and I'd strongly prefer to avoid that maintenance cost.
We could suggest that if a value is not set here, implementations should consider this the same value as the overall request timeout, but I can't think of any way to validate that, and I'm also not sure that's what we want. Let's say that an implementation has a default of 2 retries with a 10s per-try timeout. If a user configures an overall request timeout of 20s, I think they'd probably prefer the implementation default behavior here instead of that 1 value actually affecting the per-try timeout value as well. In general I'd like to suggest working towards the principle of least surprise, and I think that means avoiding a variable default like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you set static defaults through CRD, like +default=
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, static would be easier here, but what had been suggested here in the original GEP was that if this field was not specified but the overall request timeout was specified, this field should default to the value of the overall request timeout. I don't think we can/should do that. Also since this is a new field and many implementations already have default timeouts, I don't think we can add any static timeout in a backwards compatible way.
I am sure you can make a regexp that matches the Go duration syntax? We
just happen to use a webhook, there are likely better options.
…On Mon, Jul 3, 2023 at 6:58 AM Frank Budinsky ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In geps/gep-1742.md
<#2155 (comment)>
:
> @@ -364,7 +364,7 @@ sequenceDiagram
Both timeout fields are string duration values as specified by
[Golang time.ParseDuration](https://pkg.go.dev/time#ParseDuration) and MUST be >= 1ms
-or 0 to disable (no timeout).
+and <= 100y.
Istio uses protobuf to define all its APIs. Here's a duration field
<https://github.com/istio/api/blob/f930ebfda37a76446f229bf7e465bbbf305501af/networking/v1beta1/virtual_service.proto#L618>
example. In the generated GO binding that translates to
durationpb.Duration
<https://pkg.go.dev/google.golang.org/protobuf/types/known/durationpb>,
similar API to metav1.Duration.
—
Reply to this email directly, view it on GitHub
<#2155 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEYGXKXPJ2SBBWVOTUJEGLXOLFZJANCNFSM6AAAAAAZXOR5BE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@robscott: GitHub didn't allow me to request PR reviews from the following users: srodi. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@@ -403,13 +401,22 @@ type HTTPRouteTimeouts struct { | |||
// request stream has been received instead of immediately after the transaction is | |||
// initiated by the client. | |||
// | |||
// This value follows a subset of Golang and XML Duration formatting, for example 1m, 1s, | |||
// or 1ms. This value MUST BE >= 0 and <= 9999h. When "0" is specified, it indicates that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hour is not an accepted value in your kubebuilder validation pattern. 9999h is 599940m, so if you want to allow that amount, you would need to change the pattern to allow 6 digits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks! This is an artifact of when I originally had hours as an allowed unit, will update pending the conversation above about whether or not we should add that back.
// When this field is unspecified, request timeout behavior is implementation-dependent. | ||
// | ||
// Support: Extended | ||
// | ||
// +optional | ||
// +kubebuilder:validation:Format=duration | ||
Request *metav1.Duration `json:"request,omitempty"` | ||
// +kubebuilder:validation:Pattern=^(0|[1-9][0-9]{0,3}(m|s|ms))$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9999m is not long enough if you were talking about a max of 1 year. Also, I suppose there could be websockets that require longer than 9999m. If not now, then in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are using websockets I would imagine you want infinite?
I cannot imagine a scenario where you want a timeout over 9999m but not infinite? Is there a use case I am not aware of?
I suppose if you don't want to allow super-long streams..
// +kubebuilder:validation:Format=duration | ||
Request *metav1.Duration `json:"request,omitempty"` | ||
// +kubebuilder:validation:Pattern=^(0|[1-9][0-9]{0,3}(m|s|ms))$ | ||
Request *metav1.Duration `json:"request,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Request *metav1.Duration `json:"request,omitempty"` | |
Request *metav1.Duration `json:"request,omitempty"` |
@@ -419,14 +426,22 @@ type HTTPRouteTimeouts struct { | |||
// may result in more than one call from the gateway to the destination backend service, | |||
// for example, if automatic retries are supported. | |||
// | |||
// Because the Request timeout encompasses the BackendRequest timeout, | |||
// the value of BackendRequest defaults to and must be <= the value of Request timeout. | |||
// This value follows a subset of Golang and XML Duration formatting, for example 1m, 1s, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This applies to text above this line. Can you add an explanation about whether the Request time encompasses/includes the BackendRequest time? Is Request = BackendRequest + other processing time, or is total time = Request + BackendRequest, implying that Request does NOT include the BackendRequest time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @candita. This change should only remove the value <= part. The doc still needs some words to be clear that the the Request timeout encoumpasses the BackendRequest timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought. I understand that we can't enforce that BackendRequest timeout must be <= Request timeout
, but maybe we could/should still say something like this?
// This value follows a subset of Golang and XML Duration formatting, for example 1m, 1s, | |
// Because the Request timeout encompasses the BackendRequest timeout, | |
// the value of BackendRequest should be <= the value of Request timeout to have any effect. | |
// This value follows a subset of Golang Duration formatting, for example 1m, 1s, |
If we open this can of worms we will end up with a "standard" type, which
can eventually use all the declarative validation. Before we lock this
down, we should think about how this duration-as-string can be used in CEL.
…On Tue, Jul 18, 2023, 12:08 PM Rob Scott ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In geps/gep-1742.md
<#2155 (comment)>
:
> @@ -364,7 +364,7 @@ sequenceDiagram
Both timeout fields are string duration values as specified by
[Golang time.ParseDuration](https://pkg.go.dev/time#ParseDuration) and MUST be >= 1ms
-or 0 to disable (no timeout).
+and <= 100y.
I'm talking about the difference between "Invalid value: timeout: must
match regex ...." and "Invalid value: timeout: must be greater than or
equal to 0ms"
Unfortunately I don't know of any way to do that with CRD validation. We
could do that with webhook validation, but that is far less reliable so I'd
rather cover as much as possible directly with validation defined on the
CRD.
Also not that comparing 0s to 0ms to 0h must succeed, as must "1000ms" and
1s
Agreed, we should likely have some conformance tests to confirm that.
—
Reply to this email directly, view it on GitHub
<#2155 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKWAVD3GI6A6UIQBY4AKYTXQ3NKTANCNFSM6AAAAAAZXOR5BE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Any field in the openAPI schema of a CRD can |
You can also cast a string to a duration in CEL: |
Is there a spec for strings that can be casted to duration? |
Soo, I'm not thrilled with the current situation, but here it is: From the CEL spec: "Duration strings should support the following suffixes: "h" (hour), "m" (minute), "s" (second), "ms" (millisecond), "us" (microsecond), and "ns" (nanosecond). Duration strings may be zero, negative, fractional, and/or compound. Examples: "0", "-1.5h", "1m6s"" (I believe "µs" supported by the implementation since I believe it uses the golang impl under the hood) For comparison: kube-openapi: https://github.com/kubernetes/kube-openapi/blob/master/pkg/validation/strfmt/duration.go#L33-L42 (supports "d" (day) and "w" or "wk" (week) in addition to what is supported by golang impl-- it calls down to the golang impl except for these cases) Golang: https://pkg.go.dev/time#ParseDuration (supports "µs" as an alternative to "us") I'm not aware of a recognized standard for any of this.. |
This issue is proposing a kube standard for fields which represent duration.
We're trying to keep it small and easy, since it will require client-side
parsing to use (meaning we can't ever change it, really).
We described a subset of CEL and Go - no fractional or compound, and noting
smaller than ms or larger than h.
No literal 0, because YAML is too pervasive and will coerce to int.
Fractional would mean allowing things like "0.001us" and defining
floating-point precision.
Compound is just harder to spec, parse, and test.
All of those are potentially debatable.
As far as I can see, we don't have any actual duration-string fields in GA
APIs, yet. Tell me if I am wrong there?
I don't care if a CEL expression allows a broader set for casting, as long
as none of our APIs allow that to be saved. We would need to be careful
that CEL emits values in our format (e.g. always 90s and never 1m30s),
which could be able argument for adopting CEL's spec directly.
I think we need to either own our own fate or else ban these strings.
…On Tue, Jul 18, 2023, 3:47 PM Joe Betz ***@***.***> wrote:
Is there a spec for strings that can be casted to duration?
Soo, I'm not thrilled with the current situation, but here it is:
From the CEL spec
<https://github.com/google/cel-spec/blob/master/doc/langdef.md>:
"Duration strings should support the following suffixes: "h" (hour), "m"
(minute), "s" (second), "ms" (millisecond), "us" (microsecond), and "ns"
(nanosecond). Duration strings may be zero, negative, fractional, and/or
compound. Examples: "0", "-1.5h", "1m6s""
For comparison:
openapi:
https://github.com/kubernetes/kube-openapi/blob/master/pkg/validation/strfmt/duration.go#L33-L42
(also supports "d" (day), "w" or "wk" (week)
Golang: https://pkg.go.dev/time#ParseDuration (almost identical but also
supports "µs" as an alternative to "us")
I'm not aware of a recognized standard for any of this..
—
Reply to this email directly, view it on GitHub
<#2155 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKWAVGUPBQUBZYQDK5PPRLXQ4HA3ANCNFSM6AAAAAAZXOR5BE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@robscott What's the latest status on this duration format decision? I've added conformance tests for the new timeout features assuming we're going with the restricted string format: #2254 If this is going to continue to block the timeout features, maybe we should just switch the fields to uint32 millisecond values and live with the less nice API? |
@frankbu sorry this has taken so long! I think there's a rough consensus that we should aim for a format that is a subset of both OpenAPI/CEL and Go duration formats. That would include the following limitations:
The remaining problem is if we can separate validation of formatting from min/max values. CEL should enable us to do that, but first we need to add CEL (x-ref #2253 and others) |
This one doesn't really sound necessary to me. I understand that |
I think a good first step would be to try to convince the openapi layer to validate fields of a specific type against the format we want. I don't know where to do that, but I think we should shoot for a primitive field type. E.g. And if we think CEL is the path forward, maybe just adopting CEL's format is best? |
@thockin, your suggestion of maybe starting with CEL's format sounded pretty sane to me. I spent some time yesterday looking over code for CEL and other extant formats, then spent some time tl;dr: Let's use the CEL duration type, restricted by a regex of Keep reading for all the details. I found four different extant formats that seem obviously worth considering. (This starts out very Go-centric because Go is the implementation language for Kubernetes itself. Bear with me.)
|
@kflynn your suggestion SGTM! Just wondering why you think we need compound units (i.e., something like |
@frankbu There are two reasons:
|
Personally, I think not supporting floating point is the much bigger chance of people getting burned with cut-and-paste. I think values like
True, but once your talking more than a couple of hours (easy math), what's the chance that you still need finer granularity than an hour, e.g., Anyway, I'd be inclined to go with the more restrictive format, but it's not that big a deal either way IMO. |
@kflynn Maybe allowing 1 digit of decimal precision would be a good compromise, i.e. |
@frankbu Yeah, that is definitely my worry about floats. 😐 I'm more prepared to defend that, because floats are pretty awful under the hood, so I'm more invested in ditching it. (Even So: the proposal in GEP-2257 doesn't support floating-point at all, and I'd be delighted to land that and update GEP-1742 to use it... and if the screams about floating-point being gone are load, well, it's Experimental, we can figure out how to handle it. 🙂 |
Hey @robscott, now that #2258 has merged, are you planning to update this PR to align with the new duration format? If you're too busy and would like to close this one, I can take a pass at a new PR to do it and fix the other GEP-1742 issues. As soon as we get this done, we will also update #2013 so that it can hopefully get merged. |
@robscott: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind cleanup
/kind gep
What this PR does / why we need it:
While reviewing the implementation of this GEP, I realized that the upper and lower bounds of these timeout values were not what I expected. This makes the following changes:
1. Introduces an upper bound of 10 years
In Kubernetes APIs we always try to have a clear upper and lower bound for fields. This enables implementations to write tests that ensure they can support the min and max values for any field in the API. As far as I can tell, 10 years is lower than all implementations support, but still likely sufficiently high to not be a limiting factor for anyone. Note that our API versioning guidelines allow us to loosen validation in future releases, but we try to avoid tightening validation wherever possible. This means that we should aim to start with the tightest validation that is reasonable for a feature.
2. Removes the possibility of setting 0 to mean "no timeout"
Although both HAProxy and Envoy support 0 as "no timeout", I'm not sure how broadly supportable this concept of an "infinite" timeout would be beyond those two dataplanes. If 0 is really just intended to represent "max timeout", why not define a max timeout supported by the API and suggest users set that if that's really what they want? That feels both more specific and less likely to surprise users. If we ever determine that we really do need a "0" value here, we can loosen validation in the future to support it, but once we've allowed that value it's nearly impossible to undo that.
3. Clarifies that values should be rounded up to the closest value an implementation can support
There are likely some implementations that are not going to be able to support the level of granularity supported by Golang's concept of duration. This provides guidance to those implementations for how they should handle this. We likely also will want to add something to status for this in the future, but there's not an obvious condition to attach that to yet.
Does this PR introduce a user-facing change?:
/cc @frankbu @SRodi @candita @shaneutt @howardjohn