Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating GEP-1742 to clarify timeout formatting #2155

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 26 additions & 11 deletions geps/gep-1742.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ sequenceDiagram
participant C as Client
participant P as Proxy
participant U as Upstream

C->>P: Connection Started
activate U
activate C
Expand All @@ -224,7 +224,7 @@ sequenceDiagram
activate U
note left of U: timeout queue<br/>(wait for available server)
deactivate U

P->>U: Connection Started
activate U
P->>U: Starts sending Request
Expand Down Expand Up @@ -370,7 +370,7 @@ sequenceDiagram

Both timeout fields are string duration values as specified by
[Golang time.ParseDuration](https://pkg.go.dev/time#ParseDuration) and MUST be >= 1ms
or 0 to disable (no timeout).
and <= 100y.
Copy link
Contributor

@frankbu frankbu Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments related to this in the API PR (https://github.com/kubernetes-sigs/gateway-api/pull/2013/files#r1245498559), but thinking about this more, I really don't see the value of specifying an upper limit.

The more I think about this, it seems that any value greater than a day or so, basically means no timeout. Certainly nobody would want to set a timeout to something like 5 years and really expect that the client and service will both be up and running in a paused state until it returns a value before the 5y deadline.

Unless I'm missing something, a user will want to configure the timeout to something relatively short, and any larger value really just mean "don't timeout", i.e., just hang until the client connection is terminated. So that's why I think the current design is right. User can set the timeout to some small value or 0 to disable timeouts. A user can also chose to set a pretty huge value which the implementation will likely support (up to the implementation limit of probably an int64 max value ns of ~300 years), but any specific number they chose that is large will work the same as 0 for all practical purposes. So documenting some arbitrary large max value is really meaningless IMO.

Copy link
Contributor

@frankbu frankbu Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more related thought. If we want to say anything about max value it could be something like the maximum value is implementation dependent, but MUST BE at least 1y, or something like that.

That way the user can set it to any sensible value they'd like (>= 1ms) or 0 for infinite (don't timeout). Note that an implementation can implement this infinite (0) timeout by setting the timeout to it's internal max value which will have the same effect given that it's 1y or more.

@robscott btw, you said:

In Kubernetes APIs we always try to have a clear upper and lower bound for fields. This enables implementations to write tests that ensure they can support the min and max values for any field in the API.

How would an implementation test that a 10 year timeout works?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would an implementation test that a 10 year timeout works?

Yeah, you're right, it's not really possible to write a test that waits this long, and any value that we can feasibly test is likely lower than at least some users would want. I think the real win here is that implementations can know that they support the full range of possible values allowed by the API just by understanding their dataplane.

If we want to say anything about max value it could be something like the maximum value is implementation dependent, but MUST BE at least 1y, or something like that.

I think that's reasonable, but I'm struggling to think of any use case that would benefit from > 1y, so distinguishing beyond that is likely not worthwhile (again I could be wrong here). My strong bias is to start with tighter limits and only loosen them if a clearly defined need is presented.

The more I think about this, it seems that any value greater than a day or so, basically means no timeout. Certainly nobody would want to set a timeout to something like 5 years and really expect that the client and service will both be up and running in a paused state until it returns a value before the 5y deadline.

Yeah I think that's probably accurate, I'm not sure what the right upper limit is here, but lower is probably better as a starting point.

So documenting some arbitrary large max value is really meaningless IMO.

There's a very strong precedent in Kubernetes API review to prevent unbounded values. Even if we were to push forward without an upper limit here, I can't imagine this passing the final level of API review before v0.8.0. Both @thockin and @khenidak have strongly pushed to avoid any unbounded values in this API.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. String-encoded durations is not the norm in kube APIs, and I think we should question whether that's the right way to represent durations. The norm is int32 with the units encoded in the name - e.g. timeoutSeconds int32, and the unit is almost always seconds.

  2. It's not really unbounded if it is a float32 - there is a natural bound.

Copy link
Member Author

@robscott robscott Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point @thockin. I had thought of something like timeoutSeconds int32 as well, it would certainly simplify the validation initially. A concern I'd have is that some users will want to be able to express more precise timeouts, for example GCP and Envoy allow you to express timeouts in terms of nanoseconds, and most implementations support at least milliseconds as an input. That leaves us with a few options:

  1. Only support one unit (probably seconds or milliseconds) - this matches the approach of ingress-nginx (seconds), GCP BackendServices (seconds), Kong Ingress (ms), and HAProxy Ingress (ms)
  2. Support multiple units, combining the values (timeout.seconds: 1 + timeout.nanoseconds: 5000000 == 1.005s). This roughly matches Envoy and GCP URLMaps which both appear to use proto duration.
  3. Support something like metav1.Duration as proposed by this GEP. This matches Istio, Linkerd, Contour, and many of the underlying dataplanes like NGINX and HAProxy

So with all of that said, we could start with 1 and leave room for 2. For example, that future could look like this:

GEP Today:

timeouts:
  request: 10s

GEP Tomorrow:

timeouts:
  request:
    seconds: 10

GEP Next Year:

timeouts:
  request:
    seconds: 10
    nanoseconds: 500000

I'm not entirely sure that I prefer that over the more open-ended duration format, but it would be simpler to validate and more in line with other Kubernetes APIs. Does this make sense to you @thockin?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind allowing 0, but in YAML it will have to be "0" because the field will be transported as a string. 0s does not suffer that problem.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave out unary + but keep unary -.

I would strongly prefer to leave that out from valid Gateway API values. Although we can define a broader type that allows negative values,

I'm talking about the difference between "Invalid value: timeout: must match regex ...." and "Invalid value: timeout: must be greater than or equal to 0ms"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind allowing 0, but in YAML it will have to be "0" because the field will be transported as a string. 0s does not suffer that problem.

Also not that comparing 0s to 0ms to 0h must succeed, as must "1000ms" and 1s

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about the difference between "Invalid value: timeout: must match regex ...." and "Invalid value: timeout: must be greater than or equal to 0ms"

Unfortunately I don't know of any way to do that with CRD validation. We could do that with webhook validation, but that is far less reliable so I'd rather cover as much as possible directly with validation defined on the CRD.

Also not that comparing 0s to 0ms to 0h must succeed, as must "1000ms" and 1s

Agreed, we should likely have some conformance tests to confirm that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching up here, this is quite the rock to turn over!

I'm supportive of using a string based definition that's a subset of metav1.Duration, however we do it, and also agree with ensuring that we have conformance tests that validate as much behavior as possible.


### GO

Expand All @@ -388,8 +388,6 @@ type HTTPRouteRule struct {
}

// HTTPRouteTimeouts defines timeouts that can be configured for an HTTPRoute.
// Timeout values are formatted like 1h/1m/1s/1ms as parsed by Golang time.ParseDuration
// and MUST BE >= 1ms or 0 to disable (no timeout).
type HTTPRouteTimeouts struct {
// Request specifies the duration for processing an HTTP client request after which the
// gateway will time out if unable to send a response.
Expand All @@ -403,13 +401,22 @@ type HTTPRouteTimeouts struct {
// request stream has been received instead of immediately after the transaction is
// initiated by the client.
//
// This value follows a subset of Golang and XML Duration formatting, for example 1m, 1s,
// or 1ms. This value MUST BE >= 0 and <= 9999h. When "0" is specified, it indicates that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hour is not an accepted value in your kubebuilder validation pattern. 9999h is 599940m, so if you want to allow that amount, you would need to change the pattern to allow 6 digits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks! This is an artifact of when I originally had hours as an allowed unit, will update pending the conversation above about whether or not we should add that back.

// no timeout should be implemented. If an implementation cannot support the granularity
// specified, it MUST round up to the closest value it supports. For example, if 1ms is
// specified, and an implementation can only configure timeouts in seconds, it MUST configure
// 1 second as the timeout. In the special case of a "0" timeout, the implementation SHOULD
// NOT implement a timeout. If that is not possible it MUST configure the maximum timeout
// value it supports.
//
// When this field is unspecified, request timeout behavior is implementation-dependent.
//
// Support: Extended
//
// +optional
// +kubebuilder:validation:Format=duration
Request *metav1.Duration `json:"request,omitempty"`
// +kubebuilder:validation:Pattern=^(0|[1-9][0-9]{0,3}(m|s|ms))$
Copy link
Contributor

@candita candita Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9999m is not long enough if you were talking about a max of 1 year. Also, I suppose there could be websockets that require longer than 9999m. If not now, then in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are using websockets I would imagine you want infinite?

I cannot imagine a scenario where you want a timeout over 9999m but not infinite? Is there a use case I am not aware of?

I suppose if you don't want to allow super-long streams..

Request *metav1.Duration `json:"request,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Request *metav1.Duration `json:"request,omitempty"`
Request *metav1.Duration `json:"request,omitempty"`


// BackendRequest specifies a timeout for an individual request from the gateway
// to a backend service. This covers the time from when the request first starts being
Expand All @@ -419,14 +426,22 @@ type HTTPRouteTimeouts struct {
// may result in more than one call from the gateway to the destination backend service,
// for example, if automatic retries are supported.
//
// Because the Request timeout encompasses the BackendRequest timeout,
// the value of BackendRequest defaults to and must be <= the value of Request timeout.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think defaulting BankendRequest to the value or Request is both easily implementable and a sensible approach. See my related comment here: https://github.com/kubernetes-sigs/gateway-api/pull/2013/files#r1245517080

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented in that thread first, copying over here for additional visibility:

Unfortunately we don't have a means to implement variable defaulting unless we want to create and maintain a defaulting webhook, and I'd strongly prefer to avoid that maintenance cost.

We could suggest that if a value is not set here, implementations should consider this the same value as the overall request timeout, but I can't think of any way to validate that, and I'm also not sure that's what we want. Let's say that an implementation has a default of 2 retries with a 10s per-try timeout. If a user configures an overall request timeout of 20s, I think they'd probably prefer the implementation default behavior here instead of that 1 value actually affecting the per-try timeout value as well. In general I'd like to suggest working towards the principle of least surprise, and I think that means avoiding a variable default like this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you set static defaults through CRD, like +default= ?

Copy link
Member Author

@robscott robscott Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, static would be easier here, but what had been suggested here in the original GEP was that if this field was not specified but the overall request timeout was specified, this field should default to the value of the overall request timeout. I don't think we can/should do that. Also since this is a new field and many implementations already have default timeouts, I don't think we can add any static timeout in a backwards compatible way.

// This value follows a subset of Golang and XML Duration formatting, for example 1m, 1s,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies to text above this line. Can you add an explanation about whether the Request time encompasses/includes the BackendRequest time? Is Request = BackendRequest + other processing time, or is total time = Request + BackendRequest, implying that Request does NOT include the BackendRequest time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @candita. This change should only remove the value <= part. The doc still needs some words to be clear that the the Request timeout encoumpasses the BackendRequest timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought. I understand that we can't enforce that BackendRequest timeout must be <= Request timeout, but maybe we could/should still say something like this?

Suggested change
// This value follows a subset of Golang and XML Duration formatting, for example 1m, 1s,
// Because the Request timeout encompasses the BackendRequest timeout,
// the value of BackendRequest should be <= the value of Request timeout to have any effect.
// This value follows a subset of Golang Duration formatting, for example 1m, 1s,

// or 1ms. This value MUST BE >= 0 and <= 9999h. When "0" is specified, it indicates that
// no timeout should be implemented. If an implementation cannot support the granularity
// specified, it MUST round up to the closest value it supports. For example, if 1ms is
// specified, and an implementation can only configure timeouts in seconds, it MUST configure
// 1 second as the timeout. In the special case of a "0" timeout, the implementation SHOULD
// NOT implement a timeout. If that is not possible it MUST configure the maximum timeout
// value it supports.
//
// When this field is unspecified, request timeout behavior is implementation-dependent.
//
// Support: Extended
//
// +optional
// +kubebuilder:validation:Format=duration
BackendRequest *metav1.Duration `json:"backendRequest,omitempty"`
// +kubebuilder:validation:Pattern=^(0|[1-9][0-9]{0,3}(m|s|ms))$
BackendRequest *string `json:"backendRequest,omitempty"`
}
```

Expand Down