Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating GEP-1742 to clarify timeout formatting #2155

Closed
wants to merge 1 commit into from

Conversation

robscott
Copy link
Member

What type of PR is this?
/kind cleanup
/kind gep

What this PR does / why we need it:
While reviewing the implementation of this GEP, I realized that the upper and lower bounds of these timeout values were not what I expected. This makes the following changes:

1. Introduces an upper bound of 10 years
In Kubernetes APIs we always try to have a clear upper and lower bound for fields. This enables implementations to write tests that ensure they can support the min and max values for any field in the API. As far as I can tell, 10 years is lower than all implementations support, but still likely sufficiently high to not be a limiting factor for anyone. Note that our API versioning guidelines allow us to loosen validation in future releases, but we try to avoid tightening validation wherever possible. This means that we should aim to start with the tightest validation that is reasonable for a feature.

2. Removes the possibility of setting 0 to mean "no timeout"
Although both HAProxy and Envoy support 0 as "no timeout", I'm not sure how broadly supportable this concept of an "infinite" timeout would be beyond those two dataplanes. If 0 is really just intended to represent "max timeout", why not define a max timeout supported by the API and suggest users set that if that's really what they want? That feels both more specific and less likely to surprise users. If we ever determine that we really do need a "0" value here, we can loosen validation in the future to support it, but once we've allowed that value it's nearly impossible to undo that.

3. Clarifies that values should be rounded up to the closest value an implementation can support
There are likely some implementations that are not going to be able to support the level of granularity supported by Golang's concept of duration. This provides guidance to those implementations for how they should handle this. We likely also will want to add something to status for this in the future, but there's not an obvious condition to attach that to yet.

Does this PR introduce a user-facing change?:

NONE

/cc @frankbu @SRodi @candita @shaneutt @howardjohn

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Jun 28, 2023
@k8s-ci-robot k8s-ci-robot requested a review from shaneutt June 28, 2023 17:47
@k8s-ci-robot k8s-ci-robot added the kind/gep PRs related to Gateway Enhancement Proposal(GEP) label Jun 28, 2023
@k8s-ci-robot k8s-ci-robot requested a review from howardjohn June 28, 2023 17:47
@k8s-ci-robot
Copy link
Contributor

@robscott: GitHub didn't allow me to request PR reviews from the following users: frankbu, srodi, candita.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What type of PR is this?
/kind cleanup
/kind gep

What this PR does / why we need it:
While reviewing the implementation of this GEP, I realized that the upper and lower bounds of these timeout values were not what I expected. This makes the following changes:

1. Introduces an upper bound of 10 years
In Kubernetes APIs we always try to have a clear upper and lower bound for fields. This enables implementations to write tests that ensure they can support the min and max values for any field in the API. As far as I can tell, 10 years is lower than all implementations support, but still likely sufficiently high to not be a limiting factor for anyone. Note that our API versioning guidelines allow us to loosen validation in future releases, but we try to avoid tightening validation wherever possible. This means that we should aim to start with the tightest validation that is reasonable for a feature.

2. Removes the possibility of setting 0 to mean "no timeout"
Although both HAProxy and Envoy support 0 as "no timeout", I'm not sure how broadly supportable this concept of an "infinite" timeout would be beyond those two dataplanes. If 0 is really just intended to represent "max timeout", why not define a max timeout supported by the API and suggest users set that if that's really what they want? That feels both more specific and less likely to surprise users. If we ever determine that we really do need a "0" value here, we can loosen validation in the future to support it, but once we've allowed that value it's nearly impossible to undo that.

3. Clarifies that values should be rounded up to the closest value an implementation can support
There are likely some implementations that are not going to be able to support the level of granularity supported by Golang's concept of duration. This provides guidance to those implementations for how they should handle this. We likely also will want to add something to status for this in the future, but there's not an obvious condition to attach that to yet.

Does this PR introduce a user-facing change?:

NONE

/cc @frankbu @SRodi @candita @shaneutt @howardjohn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 28, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: robscott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 28, 2023
@robscott
Copy link
Member Author

Adding a hold for consensus.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 28, 2023
@robscott robscott mentioned this pull request Jun 28, 2023
@robscott
Copy link
Member Author

/retest

@@ -364,7 +364,7 @@ sequenceDiagram

Both timeout fields are string duration values as specified by
[Golang time.ParseDuration](https://pkg.go.dev/time#ParseDuration) and MUST be >= 1ms
or 0 to disable (no timeout).
and <= 100y.
Copy link
Contributor

@frankbu frankbu Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments related to this in the API PR (https://github.com/kubernetes-sigs/gateway-api/pull/2013/files#r1245498559), but thinking about this more, I really don't see the value of specifying an upper limit.

The more I think about this, it seems that any value greater than a day or so, basically means no timeout. Certainly nobody would want to set a timeout to something like 5 years and really expect that the client and service will both be up and running in a paused state until it returns a value before the 5y deadline.

Unless I'm missing something, a user will want to configure the timeout to something relatively short, and any larger value really just mean "don't timeout", i.e., just hang until the client connection is terminated. So that's why I think the current design is right. User can set the timeout to some small value or 0 to disable timeouts. A user can also chose to set a pretty huge value which the implementation will likely support (up to the implementation limit of probably an int64 max value ns of ~300 years), but any specific number they chose that is large will work the same as 0 for all practical purposes. So documenting some arbitrary large max value is really meaningless IMO.

Copy link
Contributor

@frankbu frankbu Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more related thought. If we want to say anything about max value it could be something like the maximum value is implementation dependent, but MUST BE at least 1y, or something like that.

That way the user can set it to any sensible value they'd like (>= 1ms) or 0 for infinite (don't timeout). Note that an implementation can implement this infinite (0) timeout by setting the timeout to it's internal max value which will have the same effect given that it's 1y or more.

@robscott btw, you said:

In Kubernetes APIs we always try to have a clear upper and lower bound for fields. This enables implementations to write tests that ensure they can support the min and max values for any field in the API.

How would an implementation test that a 10 year timeout works?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would an implementation test that a 10 year timeout works?

Yeah, you're right, it's not really possible to write a test that waits this long, and any value that we can feasibly test is likely lower than at least some users would want. I think the real win here is that implementations can know that they support the full range of possible values allowed by the API just by understanding their dataplane.

If we want to say anything about max value it could be something like the maximum value is implementation dependent, but MUST BE at least 1y, or something like that.

I think that's reasonable, but I'm struggling to think of any use case that would benefit from > 1y, so distinguishing beyond that is likely not worthwhile (again I could be wrong here). My strong bias is to start with tighter limits and only loosen them if a clearly defined need is presented.

The more I think about this, it seems that any value greater than a day or so, basically means no timeout. Certainly nobody would want to set a timeout to something like 5 years and really expect that the client and service will both be up and running in a paused state until it returns a value before the 5y deadline.

Yeah I think that's probably accurate, I'm not sure what the right upper limit is here, but lower is probably better as a starting point.

So documenting some arbitrary large max value is really meaningless IMO.

There's a very strong precedent in Kubernetes API review to prevent unbounded values. Even if we were to push forward without an upper limit here, I can't imagine this passing the final level of API review before v0.8.0. Both @thockin and @khenidak have strongly pushed to avoid any unbounded values in this API.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. String-encoded durations is not the norm in kube APIs, and I think we should question whether that's the right way to represent durations. The norm is int32 with the units encoded in the name - e.g. timeoutSeconds int32, and the unit is almost always seconds.

  2. It's not really unbounded if it is a float32 - there is a natural bound.

Copy link
Member Author

@robscott robscott Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point @thockin. I had thought of something like timeoutSeconds int32 as well, it would certainly simplify the validation initially. A concern I'd have is that some users will want to be able to express more precise timeouts, for example GCP and Envoy allow you to express timeouts in terms of nanoseconds, and most implementations support at least milliseconds as an input. That leaves us with a few options:

  1. Only support one unit (probably seconds or milliseconds) - this matches the approach of ingress-nginx (seconds), GCP BackendServices (seconds), Kong Ingress (ms), and HAProxy Ingress (ms)
  2. Support multiple units, combining the values (timeout.seconds: 1 + timeout.nanoseconds: 5000000 == 1.005s). This roughly matches Envoy and GCP URLMaps which both appear to use proto duration.
  3. Support something like metav1.Duration as proposed by this GEP. This matches Istio, Linkerd, Contour, and many of the underlying dataplanes like NGINX and HAProxy

So with all of that said, we could start with 1 and leave room for 2. For example, that future could look like this:

GEP Today:

timeouts:
  request: 10s

GEP Tomorrow:

timeouts:
  request:
    seconds: 10

GEP Next Year:

timeouts:
  request:
    seconds: 10
    nanoseconds: 500000

I'm not entirely sure that I prefer that over the more open-ended duration format, but it would be simpler to validate and more in line with other Kubernetes APIs. Does this make sense to you @thockin?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind allowing 0, but in YAML it will have to be "0" because the field will be transported as a string. 0s does not suffer that problem.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave out unary + but keep unary -.

I would strongly prefer to leave that out from valid Gateway API values. Although we can define a broader type that allows negative values,

I'm talking about the difference between "Invalid value: timeout: must match regex ...." and "Invalid value: timeout: must be greater than or equal to 0ms"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind allowing 0, but in YAML it will have to be "0" because the field will be transported as a string. 0s does not suffer that problem.

Also not that comparing 0s to 0ms to 0h must succeed, as must "1000ms" and 1s

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about the difference between "Invalid value: timeout: must match regex ...." and "Invalid value: timeout: must be greater than or equal to 0ms"

Unfortunately I don't know of any way to do that with CRD validation. We could do that with webhook validation, but that is far less reliable so I'd rather cover as much as possible directly with validation defined on the CRD.

Also not that comparing 0s to 0ms to 0h must succeed, as must "1000ms" and 1s

Agreed, we should likely have some conformance tests to confirm that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching up here, this is quite the rock to turn over!

I'm supportive of using a string based definition that's a subset of metav1.Duration, however we do it, and also agree with ensuring that we have conformance tests that validate as much behavior as possible.

@@ -413,8 +417,14 @@ type HTTPRouteTimeouts struct {
// may result in more than one call from the gateway to the destination backend service,
// for example, if automatic retries are supported.
//
// Because the Request timeout encompasses the BackendRequest timeout,
// the value of BackendRequest defaults to and must be <= the value of Request timeout.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think defaulting BankendRequest to the value or Request is both easily implementable and a sensible approach. See my related comment here: https://github.com/kubernetes-sigs/gateway-api/pull/2013/files#r1245517080

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented in that thread first, copying over here for additional visibility:

Unfortunately we don't have a means to implement variable defaulting unless we want to create and maintain a defaulting webhook, and I'd strongly prefer to avoid that maintenance cost.

We could suggest that if a value is not set here, implementations should consider this the same value as the overall request timeout, but I can't think of any way to validate that, and I'm also not sure that's what we want. Let's say that an implementation has a default of 2 retries with a 10s per-try timeout. If a user configures an overall request timeout of 20s, I think they'd probably prefer the implementation default behavior here instead of that 1 value actually affecting the per-try timeout value as well. In general I'd like to suggest working towards the principle of least surprise, and I think that means avoiding a variable default like this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you set static defaults through CRD, like +default= ?

Copy link
Member Author

@robscott robscott Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, static would be easier here, but what had been suggested here in the original GEP was that if this field was not specified but the overall request timeout was specified, this field should default to the value of the overall request timeout. I don't think we can/should do that. Also since this is a new field and many implementations already have default timeouts, I don't think we can add any static timeout in a backwards compatible way.

@howardjohn
Copy link
Contributor

howardjohn commented Jul 3, 2023 via email

@robscott robscott changed the title Updating GEP-1742 to clarify min and max values Updating GEP-1742 to clarify timeout formatting Jul 10, 2023
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 10, 2023
@k8s-ci-robot k8s-ci-robot requested review from thockin and frankbu July 10, 2023 20:02
@k8s-ci-robot
Copy link
Contributor

@robscott: GitHub didn't allow me to request PR reviews from the following users: srodi.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

I've updated this PR to use the approach I've described in the comment above, PTAL.

/cc @frankbu @SRodi @shaneutt @thockin @howardjohn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -403,13 +401,22 @@ type HTTPRouteTimeouts struct {
// request stream has been received instead of immediately after the transaction is
// initiated by the client.
//
// This value follows a subset of Golang and XML Duration formatting, for example 1m, 1s,
// or 1ms. This value MUST BE >= 0 and <= 9999h. When "0" is specified, it indicates that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hour is not an accepted value in your kubebuilder validation pattern. 9999h is 599940m, so if you want to allow that amount, you would need to change the pattern to allow 6 digits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks! This is an artifact of when I originally had hours as an allowed unit, will update pending the conversation above about whether or not we should add that back.

// When this field is unspecified, request timeout behavior is implementation-dependent.
//
// Support: Extended
//
// +optional
// +kubebuilder:validation:Format=duration
Request *metav1.Duration `json:"request,omitempty"`
// +kubebuilder:validation:Pattern=^(0|[1-9][0-9]{0,3}(m|s|ms))$
Copy link
Contributor

@candita candita Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9999m is not long enough if you were talking about a max of 1 year. Also, I suppose there could be websockets that require longer than 9999m. If not now, then in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are using websockets I would imagine you want infinite?

I cannot imagine a scenario where you want a timeout over 9999m but not infinite? Is there a use case I am not aware of?

I suppose if you don't want to allow super-long streams..

// +kubebuilder:validation:Format=duration
Request *metav1.Duration `json:"request,omitempty"`
// +kubebuilder:validation:Pattern=^(0|[1-9][0-9]{0,3}(m|s|ms))$
Request *metav1.Duration `json:"request,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Request *metav1.Duration `json:"request,omitempty"`
Request *metav1.Duration `json:"request,omitempty"`

@@ -419,14 +426,22 @@ type HTTPRouteTimeouts struct {
// may result in more than one call from the gateway to the destination backend service,
// for example, if automatic retries are supported.
//
// Because the Request timeout encompasses the BackendRequest timeout,
// the value of BackendRequest defaults to and must be <= the value of Request timeout.
// This value follows a subset of Golang and XML Duration formatting, for example 1m, 1s,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies to text above this line. Can you add an explanation about whether the Request time encompasses/includes the BackendRequest time? Is Request = BackendRequest + other processing time, or is total time = Request + BackendRequest, implying that Request does NOT include the BackendRequest time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @candita. This change should only remove the value <= part. The doc still needs some words to be clear that the the Request timeout encoumpasses the BackendRequest timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought. I understand that we can't enforce that BackendRequest timeout must be <= Request timeout, but maybe we could/should still say something like this?

Suggested change
// This value follows a subset of Golang and XML Duration formatting, for example 1m, 1s,
// Because the Request timeout encompasses the BackendRequest timeout,
// the value of BackendRequest should be <= the value of Request timeout to have any effect.
// This value follows a subset of Golang Duration formatting, for example 1m, 1s,

@thockin
Copy link

thockin commented Jul 18, 2023 via email

@jpbetz
Copy link

jpbetz commented Jul 18, 2023

If we open this can of worms we will end up with a "standard" type, which can eventually use all the declarative validation. Before we lock this down, we should think about how this duration-as-string can be used in CEL.

Any field in the openAPI schema of a CRD can type: string, format:duration today. In any x-kubernetes-validation strings, the duration is understood by CEL as a duration type making it trivial to validate against min/max limits, and can be used in conjunction with datetime fields to do time math.

xref: https://kubernetes.io/docs/reference/using-api/cel/

@jpbetz
Copy link

jpbetz commented Jul 18, 2023

You can also cast a string to a duration in CEL: duration(self) < duration("1h") for type: string (but without format: duration set) if needed.

@thockin
Copy link

thockin commented Jul 18, 2023

Is there a spec for strings that can be casted to duration?

@jpbetz
Copy link

jpbetz commented Jul 18, 2023

Is there a spec for strings that can be casted to duration?

Soo, I'm not thrilled with the current situation, but here it is:

From the CEL spec: "Duration strings should support the following suffixes: "h" (hour), "m" (minute), "s" (second), "ms" (millisecond), "us" (microsecond), and "ns" (nanosecond). Duration strings may be zero, negative, fractional, and/or compound. Examples: "0", "-1.5h", "1m6s"" (I believe "µs" supported by the implementation since I believe it uses the golang impl under the hood)

For comparison:

kube-openapi: https://github.com/kubernetes/kube-openapi/blob/master/pkg/validation/strfmt/duration.go#L33-L42 (supports "d" (day) and "w" or "wk" (week) in addition to what is supported by golang impl-- it calls down to the golang impl except for these cases)

Golang: https://pkg.go.dev/time#ParseDuration (supports "µs" as an alternative to "us")

I'm not aware of a recognized standard for any of this..

@thockin
Copy link

thockin commented Jul 18, 2023 via email

@frankbu
Copy link
Contributor

frankbu commented Jul 31, 2023

@robscott What's the latest status on this duration format decision? I've added conformance tests for the new timeout features assuming we're going with the restricted string format: #2254

If this is going to continue to block the timeout features, maybe we should just switch the fields to uint32 millisecond values and live with the less nice API?

@robscott
Copy link
Member Author

@frankbu sorry this has taken so long! I think there's a rough consensus that we should aim for a format that is a subset of both OpenAPI/CEL and Go duration formats. That would include the following limitations:

  1. No literal 0 ("0s" would be supported)
  2. No fractional values
  3. No compound values
  4. Limited units (avoid day, week, and µs)

The remaining problem is if we can separate validation of formatting from min/max values. CEL should enable us to do that, but first we need to add CEL (x-ref #2253 and others)

@frankbu
Copy link
Contributor

frankbu commented Aug 1, 2023

No literal 0 ("0s" would be supported)

This one doesn't really sound necessary to me. I understand that 0 in yaml will be assumed int (not string), but that just means you need to quote the value: "0", which I'd rather do than add the s: 0s. We could also do what Contour does and allow the value of "infinity" for no timeout, but personally, I prefer 0.

@thockin
Copy link

thockin commented Aug 1, 2023

I think a good first step would be to try to convince the openapi layer to validate fields of a specific type against the format we want. I don't know where to do that, but I think we should shoot for a primitive field type. E.g. Timeout kubeapi.DurationString such that it doesn't require any explicit validation in CEL (assuming any parseable value is OK). Then you could use it with CEL validation to check the bounds, but not the format.

And if we think CEL is the path forward, maybe just adopting CEL's format is best?

@kflynn
Copy link
Contributor

kflynn commented Aug 3, 2023

@thockin, your suggestion of maybe starting with CEL's format sounded pretty sane to me. I spent some time yesterday looking over code for CEL and other extant formats, then spent some time argudiscussing things with @robscott (😇), and ultimately ended up wanting to recommend a slightly different answer for v0.8.0.

tl;dr: Let's use the CEL duration type, restricted by a regex of ^([0-9]{1,5}(h|m|s|ms)){1,4}$. That'll let us put some bounds on the input, while being compatible with a wide range of extant code and allowing us to use CEL for min/max validation.

Keep reading for all the details.


I found four different extant formats that seem obviously worth considering. (This starts out very Go-centric because Go is the implementation language for Kubernetes itself. Bear with me.)

metav1.Duration (aka time.ParseDuration)

k8s.io/apimachinery/pkg/apis/meta/v1 defines a Duration type. Its Unmarshal method calls time.ParseDuration. Per its docs:

A duration string is a possibly signed sequence of decimal numbers, each with
optional fraction and a unit suffix, such as "300ms", "-1.5h" or "2h45m". Valid
time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".

So: negatives, multiple components, and floating-point are OK, but no units larger than hours are supported. Also note that where the doc says “µs”, both U+00B5 (Micro Sign) and U+03BC (Greek Small Letter Mu) are accepted, beautifully demonstrating the complexities of Unicode-enabled source code.

(Source: https://github.com/golang/go/blob/master/src/time/format.go)

CEL (aka, again, time.ParseDuration)

CEL in the form of google/cel-go defines a Duration type that can be converted from a string using the ConvertToType function in common/types/string.go. That function simply calls time.ParseDuration and uses the result.

(Source: https://github.com/google/cel-go/blob/master/common/types/string.go)

APIServer “OpenAPI” code (aka strfmt.ParseDuration)

Though the README for k8s.io/apimachinery says that the APIServer is one of its primary clients, k8s.io/apiserver/pkg/cel/common/values.go defines its own UnstructuredToVal that uses strfmt.ParseDuration to parse string values marked with the duration format.

strfmt.ParseDuration starts by calling time.ParseDuration on its input, and if that works, it uses the result – so everything supported by time.ParseDuration is also supported by strfmt.ParseDuration. If time.ParseDuration doesn’t work, strfmt.ParseDuration will separately parse multiple components, but it does not support floating-point, and it does support “d” and “w” for days and weeks (neither of which take leap seconds or leap years into account). It also supports more spelled-out units (e.g. “milli” for “ms” and “hour” for “h”).

(Why, if the file has cel in its name, do I call this “OpenAPI” code? Because this split between the string type and the duration format is an OpenAPI thing, not a CEL thing, and because the APIServer is described as using OpenAPI for CRD validation.)

(Source: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/cel/common/values.go)

ISO8601/RFC3339

We would be remiss not to include ISO8601 here. It’s much less user-friendly than the Go duration types above, but it is the standard representation of time for many use cases. An example ISO8601 duration is “P1DT1H1M1.1S” for 1 day, 1 hour, 1 minute, and 1.1 seconds: the leading “P” is required, no units smaller than seconds are supported (although fractional seconds are supported), and all the units are case-sensitive.

RFC3339 is mostly equivalent to ISO8601, but is case-insensitive. The leading “P” or “p” is still required.

Recommendation

Given that everything above except ISO8601 has a common denominator of Go's time.ParseDuration, I think we should start with that format. It's already used in apimachinery, CEL, and the APIServer. It’s (obviously) supported in Go. It’s also supported in Rust in the kube-rs crate (as of kube-rs pull 1224, merged on 5 June 2023), and in Python in the durationpy package. These three languages should encompass the vast majority of the development in the Kubernetes ecosystem.

That being said, it definitely has downsides. Summarizing a lot of earlier discussion:

  • Supporting floating-point values opens its own can of worms around precision, corner cases, normalization, etc., and time.ParseDuration doesn’t discuss these. Looking at the source, it uses float64 (for everything, in fact, even if the number is actually written without a fractional part), and Go seems to just assume that saying “IEEE754” will cover the rest.

  • Negative durations can easily come up when you’re doing calculations with time values, but it’s tough to think of a use case for specifying a negative duration. time.ParseDuration’s ability to handle a negative duration may never be relevant for us.

  • time.ParseDuration doesn't support anything larger than hours. While timeouts are very unlikely to need days or weeks, the type is a Duration, not a Timeout, and days and weeks can be quite convenient when describing e.g. the lifetime of a TLS certificate (even though the disclaimer about leap seconds and days is critical).

  • time.ParseDuration is extremely open-ended in what it accepts. For example, it doesn't limit the maximum length of the input or the number of individual components. It does have a maximum numeric value of 1<<63 for numeric values (both for individual components and for the overall sum), but that's not really a meaningful limit.

Given those downsides, some further restriction of time.ParseDuration seems in order. @robscott and I think that it would work well to only allow durations that match

^([0-9]{1,5}(h|m|s|ms)){1,4}$

which is to say:

  • Negative values and floating-point numbers are not allowed.
  • Each component must have a numeric value with a maximum of five digits.
  • At most four components are allowed.
  • Hours, minutes, seconds, and milliseconds are supported.
  • Microseconds, days, and weeks are considered out of scope for, but could be added later.

That will allow us to put some bounds on the inputs while also just using the native CEL duration type (since anything that the regex matches can be parsed by time.ParseDuration, and that's what the CEL duration type uses). It should play nicely with using CEL for min/max validation, and it still allows "0s" for "no timeout".

Critically, it restricts the scope for v0.8.0 to something manageable. We can leave discussion of whether we want to support "d" and "w", for example, for v1.0.0; the only change we're talking about for v0.8.0 is adding a regex.

(For the curious, the maximum duration of 99999h99999m99999s99999ms is a rather unwieldy way to express about 11.6 years.)

@frankbu
Copy link
Contributor

frankbu commented Aug 4, 2023

@kflynn your suggestion SGTM! Just wondering why you think we need compound units (i.e., something like 1s500ms instead of just 1500ms)? Seems like extra complication for no real value add.

@kflynn
Copy link
Contributor

kflynn commented Aug 4, 2023

@frankbu There are two reasons:

  1. People are already accustomed to it in multiple other contexts, which means they'll try to use it (possibly just by copying and pasting without thinking about it from e.g. some other CRD that allows compound durations). When it doesn't work, they'll be confused and upset and they'll blame us. Since the capability to parse compound durations is already in the code, I'd prefer to just avoid making them confused and upset. 🙂 (Dropping floating-point support already concerns me on this front, FTR.)

  2. Being forced to drop to the smaller unit instead of support compound durations raises some usability concerns. Going from 1s500ms to 1500ms isn't really that bad, but going from 3h45m to 225m means I had to actually think about the math (🙂) and it also means that a casual reader is likely to be thinking "oh, that's a minutes-long duration" before realizing that it's actually hours. (I realize that it's very unlikely that we'll need to set timeouts of nearly four hours; this is mostly a concern for other uses of durations.)

@frankbu
Copy link
Contributor

frankbu commented Aug 4, 2023

(Dropping floating-point support already concerns me on this front, FTR.)

Personally, I think not supporting floating point is the much bigger chance of people getting burned with cut-and-paste. I think values like 1.5s or 3.5h are more common than 1s500ms or 3h30m.

but going from 3h45m to 225m means I had to actually think about the math (🙂)

True, but once your talking more than a couple of hours (easy math), what's the chance that you still need finer granularity than an hour, e.g., 48h2m 😃

Anyway, I'd be inclined to go with the more restrictive format, but it's not that big a deal either way IMO.

@frankbu
Copy link
Contributor

frankbu commented Aug 4, 2023

Personally, I think not supporting floating point is the much bigger chance of people getting burned with cut-and-paste.

@kflynn Maybe allowing 1 digit of decimal precision would be a good compromise, i.e. 0.5sbut not 0.333s?

@kflynn
Copy link
Contributor

kflynn commented Aug 4, 2023

@frankbu Yeah, that is definitely my worry about floats. 😐 I'm more prepared to defend that, because floats are pretty awful under the hood, so I'm more invested in ditching it. (Even 0.1 can't actually be represented exactly in an IEEE754 float.)

So: the proposal in GEP-2257 doesn't support floating-point at all, and I'd be delighted to land that and update GEP-1742 to use it... and if the screams about floating-point being gone are load, well, it's Experimental, we can figure out how to handle it. 🙂

@frankbu
Copy link
Contributor

frankbu commented Aug 14, 2023

Hey @robscott, now that #2258 has merged, are you planning to update this PR to align with the new duration format? If you're too busy and would like to close this one, I can take a pass at a new PR to do it and fix the other GEP-1742 issues. As soon as we get this done, we will also update #2013 so that it can hopefully get merged.

@robscott
Copy link
Member Author

Thanks to everyone for the feedback here and to @kflynn for finalizing this with GEP #2257 and then replacing this with #2316. Going to close this one out.

/close

@k8s-ci-robot
Copy link
Contributor

@robscott: Closed this PR.

In response to this:

Thanks to everyone for the feedback here and to @kflynn for finalizing this with GEP #2257 and then replacing this with #2316. Going to close this one out.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/gep PRs related to Gateway Enhancement Proposal(GEP) release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants