Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2872 Add keystone containers KEP #2869

Closed
wants to merge 7 commits into from

Conversation

adisky
Copy link
Contributor

@adisky adisky commented Aug 17, 2021

Signed-off-by: Aditi Sharma [email protected]

  • One-line PR description:
    Add KEP for keystone containers

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 17, 2021
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Aug 17, 2021
@adisky adisky force-pushed the keystone-container-kep branch 2 times, most recently from 33e61a1 to 78b2777 Compare August 17, 2021 10:14
@adisky
Copy link
Contributor Author

adisky commented Aug 17, 2021

cc @matthyx @rata

@adisky adisky force-pushed the keystone-container-kep branch from 78b2777 to f369f6a Compare August 18, 2021 07:22
@adisky adisky mentioned this pull request Aug 18, 2021
4 tasks
@adisky adisky force-pushed the keystone-container-kep branch from f369f6a to 29a12db Compare August 18, 2021 07:29
@adisky adisky changed the title [WIP] KEP-753 Add keystone containers KEP [WIP] KEP-2872 Add keystone containers KEP Aug 18, 2021
@adisky adisky force-pushed the keystone-container-kep branch from 29a12db to 0cc759b Compare August 18, 2021 07:42
@adisky adisky changed the title [WIP] KEP-2872 Add keystone containers KEP KEP-2872 Add keystone containers KEP Aug 18, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 18, 2021
@adisky
Copy link
Contributor Author

adisky commented Aug 23, 2021

cc @derekwaynecarr @dchen1107

@adisky
Copy link
Contributor Author

adisky commented Sep 2, 2021

/cc @mrunalp

@k8s-ci-robot k8s-ci-robot requested a review from mrunalp September 2, 2021 10:44

There are few options for annotations:

1. Specifying a list of keystone or sidecar containers in one annotation
Copy link

@timbyr timbyr Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Believe this can produce some problems with the usage of mutating webhooks to inject sidecar configurations.

The usage of alpha.sidecars.kubernetes.io/<container name>:string vs alpha.node.kubernetes.io/keystone: or alpha.node.kubernetes.io/sidecars: reduces cost of uptake of this feature for mutatingwebhook based sidecar injection as the mutators can define their container without requiring the workload owners to uptake the feature.

Believe this behaviour aligns with the noted motivation as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not sure I follow what you say. If the list of keystone containers is the value of one annotation, what happens exactly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timbyr Do you mean doing this to allow webhooks to mark their own injected containers as keystone containers (or explicitly as not keystones)?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example gatekeeper's AssignMetadata is only able to set an annotation, it can't append to an existing annotation. So, an annotation per container-name is easier to implement.

Using the istio example, the following would be enough:

apiVersion: mutations.gatekeeper.sh/v1alpha1
kind: AssignMetadata
metadata:
  name: istio-proxy-is-not-keystone
spec:
  applyTo:
  - groups:
    - ""
    kinds:
    - Pod
    versions:
    - v1
  location: metadata.annotations."sidecars.alpha.kubernetes.io/istio-proxy"
  parameters:
    assign:
      value: sidecar  # other possible value: keystone

(NB: the name and value of the annotation is an example)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree with the above, as many sidecars are injected the only reliable way to implement this as a webhook implementation would be to add an annotation-per-sidecar container. Anything else has ordering issues in the face of multiple webhooks I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will avoid using alpha.sidecars.kubernetes.io/<container name>:string and prefer alpha.node.kubernetes.io/keystone or alpha.node.kubernetes.io/sidecars

@howardjohn sorry I did not understand single annotation vs list problem

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say a user creates a pod with 1 container. There are 2 webhooks injecting 2 sidecars.

  • Using the keystone annotation is not viable, because no webhook can safely inject it because they don't know if something is a keystone or not. I would NOT expect users to set that something is a keystone manually when using auto injected sidecars. For example, the kubectl defat container annotation is injected by Istio.
  • So that leaves the sidecar annotation. I suppose it's possible to use it but you would need some logic to append to the list if it exists and crate it if not. For example, the gatekeeper example above probably wouldn't work. Note that a real API field probably would since a merge would append to the list.
    It also leaves room for misconfigurations of both annotations are set (possibly by 2 different webhooks).

Copy link
Member

@rata rata Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's go for the other option lister (one annotation per container). Thanks for the input!

Copy link
Member

@rata rata Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking into account Tim's input here, the way to go for now is an annotation for a keystone container, and have an artificial limit of 1 keystone container for now that we can later lift. See also this and this

This will only be to gather data, we will move to a pod.spec in the future, probably when we implement one of the pre-proposals (we need to decide which one, etc. first)

@endocrimes
Copy link
Member

/cc

@adisky
Copy link
Contributor Author

adisky commented Oct 12, 2021

@thockin We would like to have your early input on this idea since you looked into the sidecar problems in depth and also authored one sidecar proposal

/cc @thockin

@enj
Copy link
Member

enj commented Oct 24, 2021

Why is this annotation based instead of a proper (alpha feature gated) field on the pod? What value does starting off as an annotation provide?

@matthyx
Copy link
Contributor

matthyx commented Oct 24, 2021

The idea here is to not introduce a new api change before we move to beta...

@enj
Copy link
Member

enj commented Oct 24, 2021

The idea here is to not introduce a new api change before we move to beta...

The annotation is a new API, just a harder to discover and reason about one because it lacks proper typing through a field. You can also do proper conversions from an alpha field to a beta field.

@thockin
Copy link
Member

thockin commented Sep 21, 2022

Hi @matthyx and @adisky - I was OOO last week, turning back to KEPs now. Please do not give up on us! You just happened to be wading into one of the harder problems in one of the more delicate subsystems.

I'll make a nother read through this now.

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

John's argument about injection and the "polarity" of this flag stand out to me. There was a reason we had it in the other direction previously.

When I originally talked about sidecars, it was a slightly different semantic. In particular, what I meant was "if this container dies, the whole pod falls down". In particular, all other peer-containers would be torn down, emptyDir volumes would be wiped, namespaces would be torn down and re-established, etc. It was the "have you tried rebooting it" model. I'm not actually convinced that is a useful semantic, but it is notable that there's no way to express it today.

@derekwaynecarr @dchen1107 Do you have thoughts? Do you acknowledge the problem described here? Do you think it is worth solving at all?


This is particularly helpful in solving the blocking issue of sidecar containers with jobs. This will also help us to make a step forward with the sidecar KEP.

We are calling `application/essential/main/critical` containers as `keystone` containers in this KEP to distinguish them from sidecars.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am partly to blame for saying "keystone" - I'm not sure it's the best word, so we should be open to alternative naming, e.g. "primary" or something


This identification of application containers is used by kubelet to make a decision on pod termination based on the status of application containers and pod restart policy.

- if all application containers exited successfully and the restart policy of the pod is "never" or "onFailure" then terminate the pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"terminate the pod, respecting the value of terminationGracePeriodSeconds."

Right? This still gives non-keystone containers a chance to finish their jobs, push logs, whatever.

-->

### Non-Goals
- Handling Container startup/shutdown ordering for application and sidecar containers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In re-familairizing myself with this, it really is mostly the inverse of the previous sidecar KEP, isn't it? It's simpler in that it doesn't ascribe startup or shutdown ordering, which does limit the utility of it. This KEP alone does not really solve the network-proxy sidecar problems (not completely, anyway). The question is - does it provide enough value to justify itself for now?

Let's assume that the fundamental problems that have stalled the bigger solution will not go away any time soon. This KEP, if implemented, could be around for years. If we leave room, maybe it could grow startup/shutfown ordering semantics. If we add "real" sequencing (whether that is a DAG or phases or ...), it would obsolete this, but I think we could still define it compatibly - you get to pick one or the other model, but not both. It's tech-debt that we could probably never drop, but it's not short-term debt. I would not take this on to accelerate by 1 or 2 releases, but I might take it on to save O(years).

Update KEP template and add PRR YAML file
@matthyx matthyx force-pushed the keystone-container-kep branch from de3fcbe to 09a43b7 Compare September 23, 2022 11:38
@rata
Copy link
Member

rata commented Sep 26, 2022

Commenting here just for visibility. Please take a look at @SergeyKanzhelev proposal to add a per-container restart policy: #2869 (comment)

This is a simpler way to mimick the same behavior proposed in this KEP.

Please note that both sidecar proposals to solve the dependecy issue in a general way (phases and explicit dependencies) also suggested the creation of a per-container restart policy. So this clearly fits nicely with any of the proposals we currently have and might implement in the future.

@karlkfi
Copy link

karlkfi commented Sep 29, 2022

This KEP alone does not really solve the network-proxy sidecar problems (not completely, anyway). The question is - does it provide enough value to justify itself for now?

Rephrased: What is the concrete use case being solved for and what are the requirements?

I honestly don't know what use case this proposal solves for, so I'll sketch out a straw-man for the ones I do understand. Feel free to identify a different set.

Use case:
Istio sidecar proxy (more generically, a "sidecar service").

Context:
The primary workload depends on an API that the sidecar service provides. If the sidecar service is not ready, the primary workload will error and exit.

Requirements:

  1. One or more sidecar services need to be ready (not just started) before the primary workload is started.
  2. All sidecars need to be terminated after the primary workload has exited (not just terminated).

Today...

  • If the primary container exits because the sidecar service is not ready, the primary workload will simply be restarted (given the right restart policy) and eventually succeed after the sidecar service is ready.
  • If the pod is terminated, the containers are all terminated at roughly the same time in a for loop, which causes flaky behavior. If the primary workload exits first and exits fast, everything might work out fine. But if it's slow, which is usually true for services that need to shut down connections, the sidecar service often ends up exiting first, and then the primary workload often errors because it tries to use the sidecar service and can't find it.
  • If the Pod is in a Job, the primary workload is usually designed to exit 0 when done. But since the sidecar service is still running, the Pod keeps running. So the Job never successfully completes.

So if you solve requirement 2 and the primary workload exiting causes the sidecar service to be terminated, then the primary workload can no longer depend on the restart behavior of being restarted until after the sidecar service is ready, creating the need for a solution to requirement 1.

Are there any other concrete use cases which have requirement 2, but not requirement 1?

If yes, then maybe this proposal to solve requirement 2 IS enough, provided it can be extended to later solve or be compatible with a different solution to requirement 1.

@thockin
Copy link
Member

thockin commented Oct 1, 2022

both sidecar proposals ... also suggested the creation of a per-container restart policy

per-container restart fleshes out the modelm but is not sufficient to describe "this is a helper".

So if you solve requirement 2 and the primary workload exiting causes the sidecar service to be terminated, then the primary workload can no longer depend on the restart behavior of being restarted until after the sidecar service is ready, creating the need for a solution to requirement 1.

Exit != success. Most jobs are restart-on-failure, that being a non-zero exit code.

Are there any other concrete use cases which have requirement 2, but not requirement 1?

Any case where a "helper" (just avoiding the word sidecar to break assumptions of semantics) provides some functionality which the app container can retry-loop around. I have argued many times that if your app exits or crashes as soon as a dep fails, you are doing it wrong. Things crash. You have to assume that anything you depend on will, occasionally, crash. As such, the startup-ordering problem is, IMO, less critical than the shutdown problem.

That said, there are some specific cases where you really do need sequencing, but I think that's more of a "X must happen before Y" problem than "X must be up for Y" one. For example, a network proxy wants to install a hook to intercept traffic before the app ever starts. I DO NOT think we are attacking that problem here.

If yes, then maybe this proposal to solve requirement 2 IS enough, provided it can be extended to later solve or be compatible with a different solution to requirement 1.

I think the proposal I made satisfies that, personally, but I feel like I am not convincing anyone.

@thockin
Copy link
Member

thockin commented Oct 1, 2022

Real discussion: #2869 (review)

@deads2k
Copy link
Contributor

deads2k commented Oct 4, 2022

PRR looks good for alpha. If the PRR changes significantly please ping in #prod-readiness

/approve
approving PRR

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: adisky, deads2k
Once this PR has been reviewed and has the lgtm label, please assign dchen1107 for approval by writing /assign @dchen1107 in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


### Risks and Mitigations

If the user identifies several (or all) containers as keystones, current logic doesn't determine the correct behavior which is whether to terminate the pod or keep it running. We mitigate this risk by allowing only 1 keystone container per pod for the moment to safely gather user feedback.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this risk. I think KEP is saying to wait for all keystone containers to complete and then terminate the rest? What part is not defined?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing they weren't planning to include monitoring logic to the kubelet to determine if anymore keystone containers exist. It sounds like they were just planning to have the kubelet check if an exited container was the keystone one, and if so, terminate the other containers/pod. I agree that the latter is simpler to implement but reduces the utility a bit.

// run post stop hook if exists
...
if featureflag keystoneContainersEnabled && containerIsKeystone(&container) {
// kill container
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean kill pod?

@@ -0,0 +1,946 @@
<!--
Copy link

@sunny-b sunny-b Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason 753 is being used? Shouldn't it be 2872-keystone-containers? IIRC, 753 is an older KEP for sidecars. Is this simply a copy/paste error?

@thockin
Copy link
Member

thockin commented Oct 6, 2022

Replying to #2869 (review) here, so it is less likely to get lost across content updates.

Update:

Derek, Dawn, Mrunal, Sergey, and I have spent a bunch of time discussing this over the last few days. The problem is still complex, although I have a much better understanding of the concerns that have held this up, I think. I'll summarize the concerns here briefly, then talk about options moving forward. Some of the converstaion is captured here: https://docs.google.com/document/d/1Q3685Ic2WV7jPo9vpmirZL1zLVJU91zd3_p_aFDPcS0/edit#bookmark=id.38mj0izh4h9o

There is some reluctance to endorse something that leaves the door open to further "semantic creep" (which could include things like: "I want to run all sidecars in gvisor sandboxes" and "I want different security properties for different kinds of containers") without ALSO addressing the need to have "sidecars" for init-containers. In fact, we're trying to avoid the word "sidecar" for now because of the baggage.

I admit that my own runlevels proposal DOES leave that door open. Though I don't think we would ever go there, the conversation WOULD come up, almost certainly.

We pivoted the conversation back to something more "surgical" that could make alpha, but if I am being honest, it doesn't feel right.

Design options being considered:

  1. pod.spec.containers[].lifecycle.onCompletion: TerminatePod - this has real drawbacks (you can only have 1 such) and unclear use-cases beyond a kludgey way to get sidecars to die when apps "finish".
  2. "runlevels" which include pre-initContainers - this feels like a good solution but I am not convinced we know enough to pull it off yet (see below)
  3. pod.spec.containers[].lifecycle.endOfLife: Coerced - the naming is terrible, but the idea was to extend option 1 to support multiple "app" containers. It's surgical, but if we implement pre-init containers, will probably be obsoleted.

The question I struggle with is whether to push thru something that has 50/50 odds of being discarded, or just to try to get option 2 to work (and likely miss this cycle).

Regardless, I think we should plan to be more actively discussing how to do this "right" until we have a design that meeds criteria. I think we're closer than we have ever been.

@matthyx said:

PRR is approved, we just need approve from sig-node leads (@dchen1107 or @derekwaynecarr)... @thockin do you think we should merge the KEP to have it tracked and continue debating after the freeze?

I don't think the current proposal works, and given the timeframe (~6 hours to KEP freeze at the time of this writing, unless we request exception) it seems unlikely that we can pin down enough about pre-init containers to satisfy KEP and PRR (I am reaching out to discuss), though I am willing to try if others are. I'm not sure release managers should make an exception for an alpha, anyway. Otherwise we miss this boat and shoot for 1.27, but we start on much better footing.

@sunny-b said:

So I'm assuming the 1.26 freeze has already started, but we should be able to merge this in afterwards for the 1.27 release? I'm not very familiar with the K8s release process. @thockin do you know when would the 1.27 release likely be? And if this were to get merged soon after the 1.26 freeze, would the 1.27 release need to come out first before any implementation could occur?

Freeze is still a few hours from now, and it is POSSIBLE to get an exception, but honestly, unlikely for an alpha. If we make it for 26, then code freeze is November 9 (not long!) and release ~Dec 6 . If we don't make it, that would put the 1.27 release around April, code freeze around March, and KEP freeze around February.

@matthyx
Copy link
Contributor

matthyx commented Oct 7, 2022

@thockin thanks for all your efforts trying, really appreciate them.
Let's create a WG for that, I'm gonna add this to sig-node's weekly meeting.
I think it'd be easier to choose someone in US timezone to lead the WG discussions, otherwise my family will hate me.
@SergeyKanzhelev can I propose you?

@SergeyKanzhelev
Copy link
Member

@SergeyKanzhelev can I propose you?

can do and will do.

@sunny-b
Copy link

sunny-b commented Oct 10, 2022

Which SIG-node weekly meeting will this be discussed in, the main meeting or triage meeting? I would like to be a fly on the wall.

@matthyx
Copy link
Contributor

matthyx commented Oct 10, 2022

Which SIG-node weekly meeting will this be discussed in, the main meeting or triage meeting? I would like to be a fly on the wall.

Tuesday, so tomorrow. It's already on the agenda.

@Jmainguy
Copy link

Looking forward to having this, how did the discussion at SIG-node weekly go?

@matthyx
Copy link
Contributor

matthyx commented Oct 24, 2022

Looking forward to having this, how did the discussion at SIG-node weekly go?

We agreed on creating a WG for the sidecar KEP, and @SergeyKanzhelev has setup a Doodle to choose the best time.
https://doodle.com/meeting/participate/id/bkZyZgJa/vote
(looks like not many folks have participated)

I think we should involve also @derekwaynecarr and @thockin into these discussions... I might chime in again in the weekly meeting after Kubecon.

@adisky
Copy link
Contributor Author

adisky commented Oct 26, 2022

Hey everyone sorry for not responding for a while, I am back from maternity leave will soon catch up on this one.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2023
@matthyx
Copy link
Contributor

matthyx commented Jan 24, 2023

superseded by #3761
/close

@k8s-ci-robot
Copy link
Contributor

@matthyx: Closed this PR.

In response to this:

superseded by #3761
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.