-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GEP-1364 proposal #1383
Add GEP-1364 proposal #1383
Conversation
Signed-off-by: Nick Young <[email protected]>
Signed-off-by: Nick Young <[email protected]>
Signed-off-by: Nick Young <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely a good direction
is no missing information, including but not limited to: | ||
* Any mandatory references resolve to existing resources (examples here are the | ||
Gateway's gatewayClass field, or the `parentRefs` field in Route resources) | ||
* Any specified TLS secrets exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the API supports partially-valid resources. I believe this is proposing a partially-valid resource would be Attached=false
, but the config is actually 'active' (or, I would describe it, "attached").
To me this feels a bit odd. The old API sort of had this issue as well.
Thinking purely from a pedantic-correctness POV (which may not be ideal - sometimes simpler is better), I would expect there to be two conditions (I use bad names intentionally to avoid confusion): "ThisResourceWasApplied" - any part of the rule had some effect on the system, and "ThisRuleIsEntirelyValid" - no errors found in the rule at all.
Its possible my concern would be addressed with a different word than "Attached", rather than more conditions, as well
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is somewhat addressed below.. but I am not sure its accurate? It seems we are defining a very small set of "failures that do not become unattached", but there are a lot more partially-accepted resources I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guideline I've been basing the "does something become unattached?" on is "Does this produce at least some configuration in the underlying data plane?". So I guess it sounds like I need to be more explicit with that guideline, but that aside from that, we're largely in agreement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to have "Is this attached" and "is this 100% valid" as two separate things.
Part of my confusion, I think, is I am not actually sure which fields result in "not attached". It seems most issues are partially accepted. Which is fine - Attached
just becomes basically a pretty high level acknowledgement that the controller is handling the resource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think I agree with @howardjohn's point here: "does this resource produce config" is distinct from "does this resource produce valid config". I'm sure some implementations don't allow invalid (e.g. duplicate route names or something) to be sent to the data plane, but it doesn't feel like that's a requirement for the Gateway API at this juncture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I wrote this pooly - by "Does this resource produce config" I meant to read "Is this resource valid enough to produce some config when combined with the rest of the resource set".
The thing I'm thinking of particularly here is that for HTTPRoute, invalid references still produce data path config because the match will respond with 500s. Whereas for TCPRoute etc, if you don't have a service to route to, there's no way you can build a forwarding path - there's nothing to forward to! So that won't produce any data path config at all, and so, in the current design, would cause the resource to fail to be attached.
I don't think anything should be allowed to produce invalid config in the underlying data plane. If so, that should mean that the resource becomes unattached.
I guess this means that we need some more clarity about what attached means. I've been assuming that if we consider a resource to be "attached" to another, that:
- the resource is valid in itself
- the resource will produce at least some valid config in the underlying data plane. (note that "return a 500 for this route" is valid config).
And so, that having a "valid" condition is not really useful because if something is not valid, it won't be attached either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for something like an HTTPRoute to define two paths, one of which is "valid enough" config, while the other is garbage?
If so, are there expectations or requirements from a conformance PoV ("always apply all or none") that we need to specify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for something like an HTTPRoute to define two paths, one of which is "valid enough" config, while the other is garbage?
Most cases of an invalid configuration in one route rule (with others being completely valid within the same HTTPRoute) seem to be misconfigurable only in the context of a reference to another resource and fall into the ResolvedRefs
bucket (which seems stated below and possibly agreed upon that Attached: true
be set with ResolvedRefs: false
and return HTTP code 500 for the invalid rule).
One such example not covered is if you were to use a HTTPRequestRedirectFilter
and HTTPURLRewrite
in the list of filters for an individual HTTPRouteRule
(ref). At the moment it doesn't appear we have prescribed any status code or other error handling for this error. This seems to fall into the bucket of things that cannot produce data plane configuration.
Going by the statements below, if this rule were the only one present Attached: false
should be set, since it cannot result in any configuration (with another more specific error condition). If present with other valid rules, Attached: true
should be set (with another more specific error condition again). I can definitely see the argument for setting Attached: false
however in the latter case, since stated above, the Attached
condition means The resource is syntatically and semantically valid, and internally consistent
.
It seems that the Valid: true
condition mentioned above ends up being somewhat equivalent to having for example for HTTPRoute ResolvedRefs: true
and no error conditions (and vice versa, Valid: false
equivalent to ResolvedRefs: false
or some more specific error condition). Maybe this framing helps a little with determining possible merits of having another condition?
If so, are there expectations or requirements from a conformance PoV ("always apply all or none") that we need to specify?
Just doing a quick pass, I don't think I saw any existing conformance tests that cover this kind of thing so I agree that would be good to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoiding "always apply all or none" is the reason this is tricky. There are mixed expectations about what happens if some config is present but bad, and I'm attempting to thread the needle here and make this as user-friendly as possible while not becoming really difficult to implement or understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a section on partial validity to hopefully make this clearer.
site-src/geps/gep-1364.md
Outdated
stuck with some sort of "will be ready in no longer than x seconds" standard | ||
(which we'll also need to write conformance tests for, probably including a | ||
stress test of adding a *lot* of records as well.) As an initial ballpark, | ||
I propose 5 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably won't work for cloud providers (at least GKE is slower, not sure about others).
I am also not sure it provides value. What do we lose if we don't define anything here?
If we consider something like Knative, they certainly aren't going to be able to just wait X seconds; I'd expect the same with other users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if we can't provide any guarantees about Readiness, then it seems like Ready becomes no different than Attached...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We certainly can mandate Ready
to be an extended condition, and say "If you supply a Ready condition, it has to work like this". Maybe doing that is a better place to start with Ready
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a feel for which implementations can and cannot support Ready
properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that cloud providers will likely have trouble implementing a Ready
condition. Could we introduce a Reconciled
condition that would be more broadly implementable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 any distributed (e.g. geo distributed) implementation has a hard time with this (think: config was pushed to 95% of the sites around the world, is it ready?)
site-src/geps/gep-1364.md
Outdated
|
||
Because this GEP is mainly concerned with updating the Conditions we are setting in | ||
Gateway API resources' `status`, it's worth reviewing some important points about | ||
Conditions. (Thi information is mainly taken from the [Typical status properties][typstatus] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing href for the link here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's all the way down at the bottom of the doc, with the other references. I thought it was better to have everything in the one place.
site-src/geps/gep-1364.md
Outdated
one entry of each item, using the `type` field as a key. (So, this is effectively | ||
a map that looks like a list in YAML form). | ||
* Each has a number of fields, the most important of which for this discussion | ||
are `type`, `status`, `season`, and `observedGeneration`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are `type`, `status`, `season`, and `observedGeneration`. | |
are `type`, `status`, `reason`, and `observedGeneration`. |
is no missing information, including but not limited to: | ||
* Any mandatory references resolve to existing resources (examples here are the | ||
Gateway's gatewayClass field, or the `parentRefs` field in Route resources) | ||
* Any specified TLS secrets exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think I agree with @howardjohn's point here: "does this resource produce config" is distinct from "does this resource produce valid config". I'm sure some implementations don't allow invalid (e.g. duplicate route names or something) to be sent to the data plane, but it doesn't feel like that's a requirement for the Gateway API at this juncture.
site-src/geps/gep-1364.md
Outdated
|
||
Note that some classes of inter-resource reference failure do _not_ cause a resource | ||
to become unattached (that is, to have the `Attached` condition set to `status: false`). | ||
* Nonexistent Service backends - if the backend does not exist on a HTTPRoute that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This starts painting an observability problem IMO. Even if conditions are only meant to be human-readable, how does a human look at an HTTPRoute resource and know "oops I have a typo in my service backend"? Yes, 500s indicate something's wrong, but I would expect to find a condition pointing to the specific issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should allow room for error conditions in addition to attached in this case so the intformation can be conveyed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I guess I should make this clearer, that in this case, it's absolutely expected that there be some additional, negative-polarity error conditions that tell you that the backendRef doesn't exist. (This is a distinct case from the backend existing, but having zero endpoints, which would not generate a "backendRef doesn't exist" error, but would end up with the same behavior).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might also get a BackendRefEmpty
condition...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work on this @youngnick! This is a great start
site-src/geps/gep-1364.md
Outdated
field was when the controller last saw a resource. | ||
* Conditions shoud describe the _current state_ of the resource at observation | ||
time, which means that they should be an adjective (like `Ready`), or a past-tense | ||
verb (like `Attached`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would Attaching
ever be valid as a condition? Maybe just verb or adjective is sufficient here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sort of usage is specifically called out as not desirable in the API conventions, and I agree. Conditions are supposed to be level-triggered, not edge-triggered, so having a condition that says that something is happening is a bit weird. (Should Attaching
transition to false
once the attaching process is complete?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For posterity, can we cite some sources for that assertion? It would probably actually be good as a part of this status effort to document somewhere in the code that we're avoiding this due to conventions, and then point to those conventions, as this will help future contributors to more quickly avoid pursuing such things (or give them a path to who they need to convince in the greater kubernetes community).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost a direct quote from the API conventions as it is, should I put a link in here to the exact line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exact quote is as follows, I can just inline if it's better, but I was trying to avoid just pasting in the entire section:
Condition type names should describe the current observed state of the resource, rather than describing the current state transitions. This typically means that the name should be an adjective ("Ready", "OutOfDisk") or a past-tense verb ("Succeeded", "Failed") rather than a present-tense verb ("Deploying"). Intermediate states may be indicated by setting the status of the condition to Unknown.
For state transitions which take a long period of time (e.g. more than 1 minute), it is reasonable to treat the transition itself as an observed state. In these cases, the Condition (such as "Resizing") itself should not be transient, and should instead be signalled using the True/False/Unknown pattern. This allows other observers to determine the last update from the controller, whether successful or failed. In cases where the state transition is unable to complete and continued reconciliation is not feasible, the Reason and Message should be used to indicate that the transition failed.
I don't believe that we generally have any process that we want to handle in the way called out in the exception bullet point. Happy to be guided here on what you'd like to see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a direct link to the language would be nice, but not a blocker.
really viable: | ||
* All conditions must be negative polarity | ||
* All conditions must be positive polarity | ||
* Some conditions can be positive polarity, but most should be negative. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My own personal preference would be for conditions with a positive polarity to represent the normal healthy state and always be present, and for conditions with a negative polarity to represent very specific error states and to only be present when true. I think we should do everything we can to avoid the presence of double negatives like "Detached == False", but recognize that there are some error states that are best represented with a negative condition. I just don't think those negative conditions should be present when false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. As I mentioned above -- as long we document and check for it in conformance, this is the only guarantee that the conditions are meaningful programmatically, which is basically what we are looking for when we build for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to have a mixed-polarity set of conditions, then I think we need to define:
- which Conditions are positive polarity (Probably
Attached
,Programmed
,Ready
, andValid
if we do it) - which Conditions are positive polarity and should always be present (Probably
Attached
,Programmed
, andValid
if we do it). - which Conditions are core (currently, everything other than
Ready
) - which Conditions are extended (only
Ready
). - which Conditions are negative (all the error conditions).
site-src/geps/gep-1364.md
Outdated
stuck with some sort of "will be ready in no longer than x seconds" standard | ||
(which we'll also need to write conformance tests for, probably including a | ||
stress test of adding a *lot* of records as well.) As an initial ballpark, | ||
I propose 5 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that cloud providers will likely have trouble implementing a Ready
condition. Could we introduce a Reconciled
condition that would be more broadly implementable?
site-src/geps/gep-1364.md
Outdated
|
||
For many implementations (certainly for Envoy-based ones), getting this information | ||
correctly and avoiding races on applying it is surprisingly difficult. For this | ||
reason, this GEP proposes we give the `Ready` condition a short window |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd rather reserve Ready
for "Ready at this moment" and instead use a different term to represent what we're actually describing - this config has been written to the data plane but we're not sure how long that will take to propagate. Previously Reconciled
has been proposed to communicate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll update this section to:
- add
Ready
as an Extended support condition, that, if present, indicates that the data plane is fully configured. But because it's not always present, we'll needReconciled
to indicate what we have here - that everything is okay and the data plane will be ready soon.
site-src/geps/gep-1364.md
Outdated
* In general, resources should be considered `Attached` if their config will | ||
generate some config in the underlying data plane. | ||
* All Conditions will be positive polarity (`GoodState: true`, not `BadState: false`) | ||
* All relevant Conditions for a resource must be added when it's observed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should state that the conditions covered in the core API must be present -- but there are likely other conditions (e.g. implementation specific), that are optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, does this mean that all conditions that have Core conformance should always be present? Or does this mean that a set of positive-polarity summary endpoints we name should always be present?
The former could include any negative-polarity conditions we define as having Core conformance (which would clash with the "don't have double negatives" guidance we're looking for in other places.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've guessed "summary conditions" here - meaning just the positive polarity ones.
really viable: | ||
* All conditions must be negative polarity | ||
* All conditions must be positive polarity | ||
* Some conditions can be positive polarity, but most should be negative. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. As I mentioned above -- as long we document and check for it in conformance, this is the only guarantee that the conditions are meaningful programmatically, which is basically what we are looking for when we build for consistency.
in the healthy case is much better rules out the first option, so we are left to | ||
decide between enforcing that all conditions are positive, or that we have a mix. | ||
|
||
Having an arbitrary mix will make doing machine-based extraction of information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this is necessarily true -- it just needs to be explicitly documented for and tested. I think the previous attempt by K8s to force one polarity just simply didn't work out because it didn't line up with people's intuition. We should really try to keep things intuitive as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means that anything consuming the Conditions will need to have a hard-coded list of the positive polarity conditions. Not a deal breaker, but it's another obstacle in the way of building a generic condition-parsing library.
site-src/geps/gep-1364.md
Outdated
|
||
Note that some classes of inter-resource reference failure do _not_ cause a resource | ||
to become unattached (that is, to have the `Attached` condition set to `status: false`). | ||
* Nonexistent Service backends - if the backend does not exist on a HTTPRoute that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should allow room for error conditions in addition to attached in this case so the intformation can be conveyed.
site-src/geps/gep-1364.md
Outdated
stuck with some sort of "will be ready in no longer than x seconds" standard | ||
(which we'll also need to write conformance tests for, probably including a | ||
stress test of adding a *lot* of records as well.) As an initial ballpark, | ||
I propose 5 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 any distributed (e.g. geo distributed) implementation has a hard time with this (think: config was pushed to 95% of the sites around the world, is it ready?)
Okay, to summarize the comments around condition naming and polarity: We're going to end up with four possible positive polarity summary conditions:
Other error conditions may be added that are negative polarity - the only thing relevant about these Conditions is if they are Three or four summary, positive-polarity conditions seems on the high side to me, but if everyone feels that the additional clarity is worth the overhead of managing more Conditions. I'm kind of inclined to say that anything we would use a |
Signed-off-by: Nick Young <[email protected]>
ValidSpec can have things not feasible to check in a webhook like temporary dependency issues (referenced service doesn't exist) or implementation specific issues (parent has FooGateway, which doesn't support the Bar field), so I think it holds weight. If we do want to collapse we could have a single Also re-iterating my comment on #1383 (comment) since I think I wasn't quite clear - I think making a |
I guess in my mind,
To put this another way, for me, Is it just the naming that's the problem, or is it the pivoting the status around the idea of attachment? |
I think I am confused. I thought ValidSpec meant it was 100% valid. So we would make ValidSpec=false if there is a unknown service. I thought the proposal was to have only the conditions listed above, so how would
Maybe.. unless the field is "fail open" and we just ignore it...? I am not sure we have a clear spec here on unsupported fields. It would be a bummer to add a filter and then suddenly have an outage because it wasn't supported. |
Ah, I guess I've been including |
To make my earlier response clearer (and I'm working on updating the text to include something like this), the idea behind In the case you're talking about using |
Signed-off-by: Nick Young <[email protected]>
|
||
The guidance about Conditions being added as soon as a controller sees a resource | ||
is a bit unclear - as written in the conventions, it seems to imply that _all_ | ||
relevant conditions should always be added, even if their status has to be set to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is reasonable for the controller to do this - as an indication the resource is in use by the controller.
It becomes problematic if a resource may be processed by multiple controllers - for example the discussion
of HttpRoute applying to both mesh and Gateway can only work if the type has a prefix (MeshReady).
In general it would be good to indicate how to handle resources where multiple controllers operate on. While
it's good to know which controller mess with the resource - maybe the controller should skip this step if the
resource is not ignored ( a mesh controller will ignore any route that don't target both mesh and gateway as example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases, the Conditions slice is in a separate struct that's part of a slice that's namespaced by controller name - so that controllers should be able to update only their own Conditions. I'd expect that mesh controllers should have a distinct controller name string, so this method should continue working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin has said that getting multiple controllers to cooperate on node health status was very subtle, so I'd be careful and maybe distill his experience if you need multiple controllers writing one status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our practice of including controllerName
for each distinct set of conditions should largely allow multiple controllers to cooperate on status for the same resource, but open to other potential improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do controllerName
conditions roll up to Accepted
?
Also, controllers probably need to be sure to use PATCH on the status
sub-resource or do conditional PUTs and avoid extra writes if the conditions haven't changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some resources, all conditions are namespaced by controllerName
. I agree we need to ensure that the rollup behavior is specified. I'm going to add a section about partial validity that should help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For HTTPRoute
as a specific example, the structure of RouteParentStatus
appears well-suited to handle a separate list of conditions for each parentRef
or controller, so a mesh or Service
relationship for a route would have a separate set of conditions and Accepted
status than those used for relaying the status of a concurrent parent relationship to a Gateway
, with no need to collaborate on writing to a single status.
Signed-off-by: Nick Young <[email protected]>
Okay, I think I've captured all the outstanding comments in that update. Please take a look, our timebox for this last pass of reviews is in just under 48 hours, so that I can start working on the implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the importance with timing of this one, I don't see anything in this GEP that's problematic enough that we couldn't merge this and if all else fails sort something out in the follow-up PR. As such I'm approving, but I will leave the LGTM to someone else as this change has a strong impact.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: keithmattix, shaneutt, youngnick The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @youngnick! These are great updates. A few nits and a question about naming but really like this direction.
site-src/geps/gep-1364.md
Outdated
|
||
* All the current Conditions that indicate that the resource is okay and ready | ||
for processing witll be replaced with `Attached`, except for GatewayClass (since | ||
it is the root of the resource tree, it will stay with `Accepted`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strongly about this, but any reason not just to use Accepted
everywhere instead of Attached
in most places and Accepted
in this one place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same thought and am +1 to this, Accepted
seems a little bit easier to understand for users IMO and can be consistently used across all resources.
of other specific negative-polarity error conditions. | ||
* All relevant positive-polarity summary Conditions for a resource must be added | ||
when it's observed. | ||
For example, HTTPRoutes must always have `Accepted` and `ResolvedRefs`, regardless |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Accepted
distinct from Attached
here? Would Accepted
ever be true if ResolvedRefs
was false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't updated, but with the change to Accepted
everywhere, this is now correct, thanks.
site-src/geps/gep-1364.md
Outdated
* The summary conditions will be added into the CRD definitions | ||
as default values for `status.Conditions`, with a status of `Unknown`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we set default values here? What happens when multiple controllers are implementing the same resource like a HTTPRoute? I would prefer relying on conformance tests to require controllers to set these conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was intending to make it that the non-controllerName
Conditions will be present, but I'll leave this out for now.
site-src/geps/gep-1364.md
Outdated
Gateway API conditions will be positive for conditions that describe the happy | ||
state of the object, which is currently `Attached` and `ResolvedRefs`, and will | ||
also include the new `Programmed` condition, and the newly-Extended condition | ||
`Ready`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this paragraph is missing a sentence that says that a separate set of Error conditions will exist that will only be set when they are true.
site-src/geps/gep-1364.md
Outdated
|
||
Positive polarity Conditions that describe the desirable state of the object must | ||
always be set. These are currently `Attached`, `ResolvedRefs`, and `Programmed`. | ||
Implementations that use `Ready` must also add it immediately, set to `false`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nit: This sentence feels a bit too prescriptive, maybe "before programming the Route" instead of saying that it needs to be set to false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that seems like a good idea.
site-src/geps/gep-1364.md
Outdated
That is, the proposal is to replace: | ||
* `Scheduled` on Gateway | ||
* `Detached` on Listener | ||
* `Accepted` on Route | ||
|
||
with `Attached` in all these locations. | ||
|
||
GatewayClass will maintain the `Accepted` condition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing it laid out this way makes me think that the smaller change would be to use Accepted
everywhere since it would only represent a change to Gateway (and Listener within it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agreed. I've migrated the proposal to this.
site-src/geps/gep-1364.md
Outdated
Note that having a correctly-defined set of resources that is empty does not make | ||
these resources unattached. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this. What would empty resources look like? If they're not unattached what are they?
site-src/geps/gep-1364.md
Outdated
* `observedGeneration` is an optional field that sets what the `metadata.generation` | ||
field was when the controller last saw a resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd had some conversations about making this required for conformant Gateway implementations. Probably out of scope for this GEP, just wanted to make sure it was intentionally excluded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not expert here things seem reasonable to me
site-src/geps/gep-1364.md
Outdated
* The resource is valid enough to produce some configuration in the underlying | ||
data plane. | ||
|
||
For Gateway, `Attached` also subsumes the functions of `Scheduled`: `Attached` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On some impls, provisioning takes about a minute. That may mean that some errors that are immediately known are masked. Or I suppose they aren't masked, but we need to wait the entire minute to know if its valid. The message field could say "all fields valid, waiting for deployment" perhaps, but that isn't in the machine parse-able section.
That may not warrant change, just a thought
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worded this carefully, so as not to imply that the provisioning needs to have completed, just that it can be started. The completion of infra provisioning should be handled by Programmed
or Ready
, or both.
[Typical status properties][typstatus] section of the guidelines. | ||
5. Conditions should be applied to a resource the first time the controller sees | ||
the resource. This seems to imply that _all conditions should be present on every | ||
resource owned by a controller_, but the rest of the conventions don't make this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conditions which appear and disappear from a resource can make status reporting UIs harder, so I'd recommend keeping the Condition set consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should have a set of permanent and reliable conditions that are always present, but I personally don't think it makes sense for error conditions to always be present, especially since that tends to lead to confusing double negative states.
|
||
The guidance about Conditions being added as soon as a controller sees a resource | ||
is a bit unclear - as written in the conventions, it seems to imply that _all_ | ||
relevant conditions should always be added, even if their status has to be set to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin has said that getting multiple controllers to cooperate on node health status was very subtle, so I'd be careful and maybe distill his experience if you need multiple controllers writing one status.
* In general, resources should be considered `Attached` if their config is valid | ||
enough to generate some config in the underlying data plane. | ||
* There will be a limited set of positive polarity summary conditions, and a number | ||
of other specific negative-polarity error conditions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make UI display somewhat harder (green dots / checkmarks / exclamation points), but I realize it may align with human reading a bit more easily.
I don't have a strong opinion here, just want to highlight the cost elsewhere in the stack.
is no missing information, including but not limited to: | ||
* Any mandatory references resolve to existing resources (examples here are the | ||
Gateway's gatewayClass field, or the `parentRefs` field in Route resources) | ||
* Any specified TLS secrets exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for something like an HTTPRoute to define two paths, one of which is "valid enough" config, while the other is garbage?
If so, are there expectations or requirements from a conformance PoV ("always apply all or none") that we need to specify?
site-src/geps/gep-1364.md
Outdated
|
||
Note that some classes of inter-resource reference failure do _not_ cause a resource | ||
to become unattached (that is, to have the `Attached` condition set to `status: false`). | ||
* Nonexistent Service backends - if the backend does not exist on a HTTPRoute that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might also get a BackendRefEmpty
condition...
is no missing information, including but not limited to: | ||
* Any mandatory references resolve to existing resources (examples here are the | ||
Gateway's gatewayClass field, or the `parentRefs` field in Route resources) | ||
* Any specified TLS secrets exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for something like an HTTPRoute to define two paths, one of which is "valid enough" config, while the other is garbage?
Most cases of an invalid configuration in one route rule (with others being completely valid within the same HTTPRoute) seem to be misconfigurable only in the context of a reference to another resource and fall into the ResolvedRefs
bucket (which seems stated below and possibly agreed upon that Attached: true
be set with ResolvedRefs: false
and return HTTP code 500 for the invalid rule).
One such example not covered is if you were to use a HTTPRequestRedirectFilter
and HTTPURLRewrite
in the list of filters for an individual HTTPRouteRule
(ref). At the moment it doesn't appear we have prescribed any status code or other error handling for this error. This seems to fall into the bucket of things that cannot produce data plane configuration.
Going by the statements below, if this rule were the only one present Attached: false
should be set, since it cannot result in any configuration (with another more specific error condition). If present with other valid rules, Attached: true
should be set (with another more specific error condition again). I can definitely see the argument for setting Attached: false
however in the latter case, since stated above, the Attached
condition means The resource is syntatically and semantically valid, and internally consistent
.
It seems that the Valid: true
condition mentioned above ends up being somewhat equivalent to having for example for HTTPRoute ResolvedRefs: true
and no error conditions (and vice versa, Valid: false
equivalent to ResolvedRefs: false
or some more specific error condition). Maybe this framing helps a little with determining possible merits of having another condition?
If so, are there expectations or requirements from a conformance PoV ("always apply all or none") that we need to specify?
Just doing a quick pass, I don't think I saw any existing conformance tests that cover this kind of thing so I agree that would be good to add.
Signed-off-by: Nick Young <[email protected]>
Okay, updated again, this feels like it's getting closer. Changelog this time:
|
Thanks @youngnick! This is a really well written GEP. Since previous generations of this PR have already been approved a couple times, I think we've got pretty clear consensus around this approach. /lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite late in doing a thorough review of this, but really happy with where it ended up - feels like a good step forwards in resolving some of the current ambiguity in status conditions. Thanks @youngnick!
2. They are a listMapType, a list that is enforced by the apiserver to have only | ||
one entry of each item, using the `type` field as a key. (So, this is effectively | ||
a map that looks like a list in YAML form). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually hadn't realized the uniqueness on type
constraint - this should probably be documented more clearly within the Gateway API spec, as it has some bearing on how "multiple error" cases may be expressed, and explains the structure of RouteParentStatus
, which will be increasingly relevant if we start expecting resources like HTTPRoute
to bind to multiple parentRefs (for GAMMA and N/S concurrent usage or route delegation).
|
||
The guidance about Conditions being added as soon as a controller sees a resource | ||
is a bit unclear - as written in the conventions, it seems to imply that _all_ | ||
relevant conditions should always be added, even if their status has to be set to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For HTTPRoute
as a specific example, the structure of RouteParentStatus
appears well-suited to handle a separate list of conditions for each parentRef
or controller, so a mesh or Service
relationship for a route would have a separate set of conditions and Accepted
status than those used for relaying the status of a concurrent parent relationship to a Gateway
, with no need to collaborate on writing to a single status.
This GEP proposes replacing all conditions that indicate syntactic and semantic | ||
validity with one, `Accepted` condition type, with the exception of the | ||
GatewayClass resource (because it's the root of the resource tree and doesn't | ||
attach to anything). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the update to use Accepted
instead of Attached
, GatewayClass is no longer an exception.
* HTTPRoute with one match with one backend that is a non-existent Service backend. | ||
The `Accepted` Condition is true, the `ResolvedRefs` condition is false. `Accepted` | ||
is true in this case because the data path must respond to requests that would be | ||
sent to that backend with a 500 response. | ||
* HTTPRoute with one match with two backends, one of which is a non-existent Service | ||
backend. The `Accepted` Condition is true, the `ResolvedRefs` condition is false. | ||
`Accepted` is true in this case because the data path must respond to a percentage | ||
of the requests matching the rule corresponding to the weighting of the non-existent | ||
backend (which would be fifty percent unless weights are applied). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For both of these cases, the ResolvedRefs
RouteConditionReason should be BackendNotFound
* HTTPRoute with one match with one backend that is in a different namespace, and | ||
does _not_ have a ReferenceGrant permitting that access. The `Accepted` condition | ||
is true, and the `ResolvedRefs` Condition is false. As before, `Accepted` is true | ||
because in this case, the data path must be programmed with 500s for the match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ResolvedRefs
RouteConditionReason should be RefNotPermitted
* HTTPRoute with one Custom supported filter added that is not supported by the | ||
implementation. Our spec is currently unclear on what happens in this case, but | ||
custom HTTP Filters require the use of the `ExtensionRef` filter type, and the | ||
setting of the ExtensionRef field to the name, group, version, and kind of a | ||
custom resource that describes the filter. If that custom resource is not supported, | ||
it seems reasonable to say that this should be a reference failure, and be treated | ||
like other reference failures (`Accepted` will be set to true, `ResolvedRefs` to | ||
false, and traffic that would have matched the filter should receive a 500 error.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the ResolvedRefs
RouteConditionReason in this case should be InvalidKind
, because of the GKV of the reference is invalid. We may want to consider adding a condition specifically for unsupported ExtensionRefs, or note that the reference with an invalid kind should be included in the Condition message field.
false, and traffic that would have matched the filter should receive a 500 error.) | ||
* A HTTPRoute with one rule that specifies a HTTPRequestRedirect filter _and_ a | ||
HTTPURLRewrite filter. `Accepted` must be false, because there's only one rule, | ||
and this configuration for the rule is invalid (see [reference][httpreqredirect]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link appears broken due to a missing footnote reference.
@mikemorris these are great bits of feedback, do you have time to create a follow up PR or issue to address any that are straightforward to fix? Concerned that otherwise we'll lose track of these. |
What type of PR is this?
/kind gep
What this PR does / why we need it:
This PR adds an implementable proposal for GEP1364 (#1364).
It's a big GEP, for a very big change, that will definitely break conformance tests. But I think that it will set us up with clear rules for Conditions going forward.
It's quite opinionated, so I expect to need to defend my decisions here somewhat, don't be shy about objections.
Which issue(s) this PR fixes:
Updates #1364
Does this PR introduce a user-facing change?:
The release note will need something like the tl;dr in the GEP, once we all agree. 😄