-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2872 Add keystone containers KEP #2869
Conversation
33e61a1
to
78b2777
Compare
78b2777
to
f369f6a
Compare
f369f6a
to
29a12db
Compare
29a12db
to
0cc759b
Compare
/cc @mrunalp |
|
||
There are few options for annotations: | ||
|
||
1. Specifying a list of keystone or sidecar containers in one annotation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Believe this can produce some problems with the usage of mutating webhooks to inject sidecar configurations.
The usage of alpha.sidecars.kubernetes.io/<container name>:string
vs alpha.node.kubernetes.io/keystone:
or alpha.node.kubernetes.io/sidecars:
reduces cost of uptake of this feature for mutatingwebhook based sidecar injection as the mutators can define their container without requiring the workload owners to uptake the feature.
Believe this behaviour aligns with the noted motivation as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, not sure I follow what you say. If the list of keystone containers is the value of one annotation, what happens exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timbyr Do you mean doing this to allow webhooks to mark their own injected containers as keystone containers (or explicitly as not keystones)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example gatekeeper's AssignMetadata is only able to set an annotation, it can't append to an existing annotation. So, an annotation per container-name is easier to implement.
Using the istio example, the following would be enough:
apiVersion: mutations.gatekeeper.sh/v1alpha1
kind: AssignMetadata
metadata:
name: istio-proxy-is-not-keystone
spec:
applyTo:
- groups:
- ""
kinds:
- Pod
versions:
- v1
location: metadata.annotations."sidecars.alpha.kubernetes.io/istio-proxy"
parameters:
assign:
value: sidecar # other possible value: keystone
(NB: the name and value of the annotation is an example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely agree with the above, as many sidecars are injected the only reliable way to implement this as a webhook implementation would be to add an annotation-per-sidecar container. Anything else has ordering issues in the face of multiple webhooks I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will avoid using alpha.sidecars.kubernetes.io/<container name>:string
and prefer alpha.node.kubernetes.io/keystone
or alpha.node.kubernetes.io/sidecars
@howardjohn sorry I did not understand single annotation vs list problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say a user creates a pod with 1 container. There are 2 webhooks injecting 2 sidecars.
- Using the keystone annotation is not viable, because no webhook can safely inject it because they don't know if something is a keystone or not. I would NOT expect users to set that something is a keystone manually when using auto injected sidecars. For example, the kubectl defat container annotation is injected by Istio.
- So that leaves the sidecar annotation. I suppose it's possible to use it but you would need some logic to append to the list if it exists and crate it if not. For example, the gatekeeper example above probably wouldn't work. Note that a real API field probably would since a merge would append to the list.
It also leaves room for misconfigurations of both annotations are set (possibly by 2 different webhooks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's go for the other option lister (one annotation per container). Thanks for the input!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking into account Tim's input here, the way to go for now is an annotation for a keystone container, and have an artificial limit of 1 keystone container for now that we can later lift. See also this and this
This will only be to gather data, we will move to a pod.spec in the future, probably when we implement one of the pre-proposals (we need to decide which one, etc. first)
/cc |
@thockin We would like to have your early input on this idea since you looked into the sidecar problems in depth and also authored one sidecar proposal /cc @thockin |
Why is this annotation based instead of a proper (alpha feature gated) field on the pod? What value does starting off as an annotation provide? |
The idea here is to not introduce a new api change before we move to beta... |
The annotation is a new API, just a harder to discover and reason about one because it lacks proper typing through a field. You can also do proper conversions from an alpha field to a beta field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
John's argument about injection and the "polarity" of this flag stand out to me. There was a reason we had it in the other direction previously.
When I originally talked about sidecars, it was a slightly different semantic. In particular, what I meant was "if this container dies, the whole pod falls down". In particular, all other peer-containers would be torn down, emptyDir volumes would be wiped, namespaces would be torn down and re-established, etc. It was the "have you tried rebooting it" model. I'm not actually convinced that is a useful semantic, but it is notable that there's no way to express it today.
@derekwaynecarr @dchen1107 Do you have thoughts? Do you acknowledge the problem described here? Do you think it is worth solving at all?
|
||
This is particularly helpful in solving the blocking issue of sidecar containers with jobs. This will also help us to make a step forward with the sidecar KEP. | ||
|
||
We are calling `application/essential/main/critical` containers as `keystone` containers in this KEP to distinguish them from sidecars. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am partly to blame for saying "keystone" - I'm not sure it's the best word, so we should be open to alternative naming, e.g. "primary" or something
|
||
This identification of application containers is used by kubelet to make a decision on pod termination based on the status of application containers and pod restart policy. | ||
|
||
- if all application containers exited successfully and the restart policy of the pod is "never" or "onFailure" then terminate the pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"terminate the pod, respecting the value of terminationGracePeriodSeconds
."
Right? This still gives non-keystone containers a chance to finish their jobs, push logs, whatever.
--> | ||
|
||
### Non-Goals | ||
- Handling Container startup/shutdown ordering for application and sidecar containers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In re-familairizing myself with this, it really is mostly the inverse of the previous sidecar KEP, isn't it? It's simpler in that it doesn't ascribe startup or shutdown ordering, which does limit the utility of it. This KEP alone does not really solve the network-proxy sidecar problems (not completely, anyway). The question is - does it provide enough value to justify itself for now?
Let's assume that the fundamental problems that have stalled the bigger solution will not go away any time soon. This KEP, if implemented, could be around for years. If we leave room, maybe it could grow startup/shutfown ordering semantics. If we add "real" sequencing (whether that is a DAG or phases or ...), it would obsolete this, but I think we could still define it compatibly - you get to pick one or the other model, but not both. It's tech-debt that we could probably never drop, but it's not short-term debt. I would not take this on to accelerate by 1 or 2 releases, but I might take it on to save O(years).
Update KEP template and add PRR YAML file
de3fcbe
to
09a43b7
Compare
Commenting here just for visibility. Please take a look at @SergeyKanzhelev proposal to add a per-container restart policy: #2869 (comment) This is a simpler way to mimick the same behavior proposed in this KEP. Please note that both sidecar proposals to solve the dependecy issue in a general way (phases and explicit dependencies) also suggested the creation of a per-container restart policy. So this clearly fits nicely with any of the proposals we currently have and might implement in the future. |
Rephrased: What is the concrete use case being solved for and what are the requirements? I honestly don't know what use case this proposal solves for, so I'll sketch out a straw-man for the ones I do understand. Feel free to identify a different set. Use case: Context: Requirements:
Today...
So if you solve requirement 2 and the primary workload exiting causes the sidecar service to be terminated, then the primary workload can no longer depend on the restart behavior of being restarted until after the sidecar service is ready, creating the need for a solution to requirement 1. Are there any other concrete use cases which have requirement 2, but not requirement 1? If yes, then maybe this proposal to solve requirement 2 IS enough, provided it can be extended to later solve or be compatible with a different solution to requirement 1. |
per-container restart fleshes out the modelm but is not sufficient to describe "this is a helper".
Exit != success. Most jobs are restart-on-failure, that being a non-zero exit code.
Any case where a "helper" (just avoiding the word sidecar to break assumptions of semantics) provides some functionality which the app container can retry-loop around. I have argued many times that if your app exits or crashes as soon as a dep fails, you are doing it wrong. Things crash. You have to assume that anything you depend on will, occasionally, crash. As such, the startup-ordering problem is, IMO, less critical than the shutdown problem. That said, there are some specific cases where you really do need sequencing, but I think that's more of a "X must happen before Y" problem than "X must be up for Y" one. For example, a network proxy wants to install a hook to intercept traffic before the app ever starts. I DO NOT think we are attacking that problem here.
I think the proposal I made satisfies that, personally, but I feel like I am not convincing anyone. |
Real discussion: #2869 (review) |
PRR looks good for alpha. If the PRR changes significantly please ping in #prod-readiness /approve |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: adisky, deads2k The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
### Risks and Mitigations | ||
|
||
If the user identifies several (or all) containers as keystones, current logic doesn't determine the correct behavior which is whether to terminate the pod or keep it running. We mitigate this risk by allowing only 1 keystone container per pod for the moment to safely gather user feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this risk. I think KEP is saying to wait for all keystone containers to complete and then terminate the rest? What part is not defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing they weren't planning to include monitoring logic to the kubelet to determine if anymore keystone containers exist. It sounds like they were just planning to have the kubelet check if an exited container was the keystone one, and if so, terminate the other containers/pod. I agree that the latter is simpler to implement but reduces the utility a bit.
// run post stop hook if exists | ||
... | ||
if featureflag keystoneContainersEnabled && containerIsKeystone(&container) { | ||
// kill container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean kill pod
?
@@ -0,0 +1,946 @@ | |||
<!-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason 753
is being used? Shouldn't it be 2872-keystone-containers
? IIRC, 753 is an older KEP for sidecars. Is this simply a copy/paste error?
Replying to #2869 (review) here, so it is less likely to get lost across content updates. Update: Derek, Dawn, Mrunal, Sergey, and I have spent a bunch of time discussing this over the last few days. The problem is still complex, although I have a much better understanding of the concerns that have held this up, I think. I'll summarize the concerns here briefly, then talk about options moving forward. Some of the converstaion is captured here: https://docs.google.com/document/d/1Q3685Ic2WV7jPo9vpmirZL1zLVJU91zd3_p_aFDPcS0/edit#bookmark=id.38mj0izh4h9o There is some reluctance to endorse something that leaves the door open to further "semantic creep" (which could include things like: "I want to run all sidecars in gvisor sandboxes" and "I want different security properties for different kinds of containers") without ALSO addressing the need to have "sidecars" for init-containers. In fact, we're trying to avoid the word "sidecar" for now because of the baggage. I admit that my own runlevels proposal DOES leave that door open. Though I don't think we would ever go there, the conversation WOULD come up, almost certainly. We pivoted the conversation back to something more "surgical" that could make alpha, but if I am being honest, it doesn't feel right. Design options being considered:
The question I struggle with is whether to push thru something that has 50/50 odds of being discarded, or just to try to get option 2 to work (and likely miss this cycle). Regardless, I think we should plan to be more actively discussing how to do this "right" until we have a design that meeds criteria. I think we're closer than we have ever been. @matthyx said:
I don't think the current proposal works, and given the timeframe (~6 hours to KEP freeze at the time of this writing, unless we request exception) it seems unlikely that we can pin down enough about pre-init containers to satisfy KEP and PRR (I am reaching out to discuss), though I am willing to try if others are. I'm not sure release managers should make an exception for an alpha, anyway. Otherwise we miss this boat and shoot for 1.27, but we start on much better footing. @sunny-b said:
Freeze is still a few hours from now, and it is POSSIBLE to get an exception, but honestly, unlikely for an alpha. If we make it for 26, then code freeze is November 9 (not long!) and release ~Dec 6 . If we don't make it, that would put the 1.27 release around April, code freeze around March, and KEP freeze around February. |
@thockin thanks for all your efforts trying, really appreciate them. |
can do and will do. |
Which SIG-node weekly meeting will this be discussed in, the main meeting or triage meeting? I would like to be a fly on the wall. |
Tuesday, so tomorrow. It's already on the agenda. |
Looking forward to having this, how did the discussion at SIG-node weekly go? |
We agreed on creating a WG for the sidecar KEP, and @SergeyKanzhelev has setup a Doodle to choose the best time. I think we should involve also @derekwaynecarr and @thockin into these discussions... I might chime in again in the weekly meeting after Kubecon. |
Hey everyone sorry for not responding for a while, I am back from maternity leave will soon catch up on this one. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
superseded by #3761 |
@matthyx: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed-off-by: Aditi Sharma [email protected]
Add KEP for keystone containers
Discussion on sig-node slack https://kubernetes.slack.com/archives/C0BP8PW9G/p1620304727344300