-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose log tracking KEP #1961
Propose log tracking KEP #1961
Changes from 34 commits
1632b59
a407e86
d2ec492
0c24512
5d32367
13bb682
b95257e
bcb70de
805b0bd
a9af677
94d9f41
be47492
a0380c3
c1f80f3
d18d9fd
b57ffd3
2da5b28
9aa6fac
da71563
bb1a496
7ce91a5
c5b2f7f
dea01f9
7dc0657
1600a21
e6d7e43
c012204
429df62
36f16f1
41cef3e
4be4d1d
9692e64
badbe6a
6fafbf0
5ccb15e
24ce1d1
693bee1
cc282e7
739eb78
d1ea659
34d8f47
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,372 @@ | ||||||||
# Log tracking for K8s component log | ||||||||
|
||||||||
<!-- toc --> | ||||||||
- [Release Signoff Checklist](#release-signoff-checklist) | ||||||||
- [Summary](#summary) | ||||||||
- [New three unique logging meta-data](#new-three-unique-logging-meta-data) | ||||||||
- [Note](#note) | ||||||||
- [Motivation](#motivation) | ||||||||
- [Goals](#goals) | ||||||||
- [Non-Goals](#non-goals) | ||||||||
- [Proposal](#proposal) | ||||||||
- [User Stories (Optional)](#user-stories-optional) | ||||||||
- [Story 1](#story-1) | ||||||||
- [Story 2](#story-2) | ||||||||
- [Summary of Cases](#summary-of-cases) | ||||||||
- [Logging metadata](#logging-metadata) | ||||||||
- [Design Details](#design-details) | ||||||||
- [Prerequisite](#prerequisite) | ||||||||
- [Design of ID propagation in apiserver](#design-of-id-propagation-in-apiserver) | ||||||||
- [Design of ID propagation (controller)](#design-of-id-propagation-controller) | ||||||||
- [Design of ID propagation in <a href="https://github.com/kubernetes/client-go">client-go</a>](#design-of-id-propagation-in-client-go) | ||||||||
- [Design of Mutating webhook(Out of tree)](#design-of-mutating-webhookout-of-tree) | ||||||||
- [Risks and Mitigations](#risks-and-mitigations) | ||||||||
- [Test Plan](#test-plan) | ||||||||
- [Graduation Criteria](#graduation-criteria) | ||||||||
- [Alpha](#alpha) | ||||||||
- [Beta](#beta) | ||||||||
- [GA](#ga) | ||||||||
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) | ||||||||
- [Feature Enablement and Rollback](#feature-enablement-and-rollback) | ||||||||
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) | ||||||||
- [Monitoring Requirements](#monitoring-requirements) | ||||||||
- [Dependencies](#dependencies) | ||||||||
- [Scalability](#scalability) | ||||||||
- [Troubleshooting](#troubleshooting) | ||||||||
- [Implementation History](#implementation-history) | ||||||||
<!-- /toc --> | ||||||||
|
||||||||
## Release Signoff Checklist | ||||||||
|
||||||||
Items marked with (R) are required *prior to targeting to a milestone / release*. | ||||||||
|
||||||||
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) | ||||||||
- [ ] (R) KEP approvers have approved the KEP status as `implementable` | ||||||||
- [ ] (R) Design details are appropriately documented | ||||||||
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input | ||||||||
- [ ] (R) Graduation criteria is in place | ||||||||
- [ ] (R) Production readiness review completed | ||||||||
- [ ] Production readiness review approved | ||||||||
- [ ] "Implementation History" section is up-to-date for milestone | ||||||||
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] | ||||||||
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes | ||||||||
|
||||||||
## Summary | ||||||||
|
||||||||
This KEP proposes a method for adding new three unique logging meta-data into K8s component logs. | ||||||||
It makes us more easy to identify specific logs related to an user request (such as `kubectl apply`) and object (such as Pod, Deployment). | ||||||||
It is expected to reduce investigation cost greatly when trouble shoothing. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/trouble shoothing/troubleshooting/ |
||||||||
|
||||||||
### New three unique logging meta-data | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There should also be a section for the annotations we want to add somewhere. |
||||||||
|
||||||||
We use three meta-data. These meta-data have different features and are used for troubleshooting from different perspectives. | ||||||||
|
||||||||
| meta-data name | feature | | ||||||||
| ------ | ------ | | ||||||||
| traceid | spans an user request. unique for user's request | | ||||||||
| spanid | spans a controller action. unique for controller action | | ||||||||
| initialtraceid | spans the entire object lifecycle. unique for related objects | | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What name would you like? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about objsetid? |
||||||||
|
||||||||
### Note | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/Note/Opt-in for Components There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Understood, we will update it later |
||||||||
|
||||||||
This KEP is **how** a component could add meta-data to logs. To actually add meta-data to K8s component logs, the following procedure is necessary in addition. | ||||||||
- Open issues for each component, and discuss them with the SIGs that own that component. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We cannot push work on other sigs and expect that they will do it on base that it's important for sig-instrumentation. We should scope work to set of components that will allow it to be beneficial and finishable by our sig-instrumentation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For our logging implementation, I think we can just open issues for each component, and discuss them with the SIGs that own that component. Do you think this plan is "push work on other sigs"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's generally easier to target SIGs individually, since KEPs can easily balloon (causing unnecessary bikeshedding). Insofar as practically applying @serathius' suggestion, it may make sense to separate 'core functionality', i.e. library code for tracing, from the component integration story. |
||||||||
- After get agreement, utilize this KEP's feature to change the source code that outputs log to add meta-data into these logs. | ||||||||
Please note that this KEP alone does not change the log format(does not add meta-data to logs). | ||||||||
|
||||||||
## Motivation | ||||||||
|
||||||||
Tracking logs among each Kubernetes component related to specific an user operation and objects is very tough work. | ||||||||
It is necessary to match logs by basically using timestamps and object's name as hints. | ||||||||
If multiple users throw many API requests at the same time, it is very difficult to track logs across each Kubernetes component log. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I understand that this is currently unsolved problem, but could you provide some info why this problem is worth solving? Why tracking user operation is important? Examples: Debugging, auditing, reproducing problems? |
||||||||
|
||||||||
### Goals | ||||||||
|
||||||||
- Implement method which propagates new logging meta-data among each K8s component | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the goal to support arbitrary meta data? The KEP is mostly discussing "three unique logging meta-data" and webhooks (as example?), but there are several cases were additional context is necessary to understand a log message. Currently code like https://github.com/kubernetes/kubernetes/blob/10744811f51cef741948676758563cfe04bb7ef2/pkg/controller/volume/persistentvolume/pv_controller.go#L413 makes sure that the object it operates on is repeated in each log message. Elsewhere I have seen ad-hoc log helper functions. But this breaks down when a function is called which then uses the global klog to print some output. Structured logging with a logger that can accumulate additional key/value pairs (i.e. How to pass the logger instance was discussed in the original structured logging KEP, but then got removed when the scope got reduced to just structured output. I was hoping that this KEP here would address that gap. FWIW, I think passing a logr instance via the context makes sense, with the global klog logger as fallback (via klogr). Logging in controller-runtime already follows that pattern (just with a different fallback) and once kubernetes-sigs/controller-runtime#1275 is implemented, that code would be compatible also with other code that uses logr. |
||||||||
- Design and implement so as not to interfere with [Tracing KEP](https://github.com/kubernetes/enhancements/pull/1458) | ||||||||
- e.g. implement of initialtraceid, adding traceid to object annotation executed in mutating webhook, etc. | ||||||||
|
||||||||
### Non-Goals | ||||||||
|
||||||||
- Add new logging metadata into actual K8s component logs | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If goal of this KEP is introducing trace information into logs then we skip this process. If the main benefit of KEP is introducing this information then KEP should include this work. Possibly in reduced scope, but enough to provide a benefit. For example structured logging KEP goal was not to rewrite all logs, but to provide interface and use it in enough places to provide benefit (25 log line changes was enough to improve 99.9% of log output). Maybe same approach would work here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As you said, we are now considering the separation to tracing and logging. If we separate, I think "Add new logging metadata into actual K8s component logs" is non-goal. In the future, we want to realize this work. Thank you for your suggestion about the method how you did in structured logging. Maybe it helps our work. |
||||||||
- This task will be done by opening issues after completing this KEP | ||||||||
- To centrally manage the logs of each Kubernetes component with Request-ID (This can be realized with existing OSS such as Kibana, so no need to implement into Kubernetes components). | ||||||||
|
||||||||
## Proposal | ||||||||
|
||||||||
### User Stories (Optional) | ||||||||
|
||||||||
#### Story 1 | ||||||||
|
||||||||
Suspicious user operation(e.g. unknown pod operations) or cluster processing(e.g. unexpected pod migration to another node) is detected. | ||||||||
Users want to get their mind around the whole picture and root cause. | ||||||||
As part of the investigation, it may be necessary to scrutinize the relevant logs of each component in order to figure out the series of cluster processing. | ||||||||
It takes long time to scrutinize the relevant logs without this log tracking feature, because component logs are independent of each other, and it is difficult to find related logs and link them. | ||||||||
|
||||||||
This is similar to the [Auditing](https://kubernetes.io/docs/tasks/debug-application-cluster/audit/), except for the following points. | ||||||||
|
||||||||
- Audit only collects information about http request sending and receiving in kube-apiserver, so it can't track internal work of each component. | ||||||||
- Audit logs can't be associated to logs related to user operation (kubectl operation), because auditID is different for each http request. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this was mentioned before, but one other difference is that audit logging is designed to be tamper-proof. The design we have been discussing (object annotations) would make it easy for an attacker to overwrite any metadata associated with objects they are messing with. A better story that wouldn't imply security guarantees might be an accidental deployment that causes problems. "AHH I did something and I need to know what it broke". |
||||||||
|
||||||||
#### Story 2 | ||||||||
|
||||||||
Failed to attach PV to pod | ||||||||
Prerequisite: It has been confirmed that the PV has been created successfully. | ||||||||
In this case, the volume generation on the storage side is OK, and there is a possibility that the mount process to the container in the pod is NG. | ||||||||
In order to identify the cause, it is necessary to look for the problem area while checking the component (kubelet) log as well as the system side syslog and mount related settings. | ||||||||
|
||||||||
This log tracking feature is useful to identify the logs related to specific user operation and cluster processing, and can reduce investigation cost in such cases. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This story described need to identify all logs related to a PV in multiple kubernetes components. Is object tracing needed in such case? I think having a annotation/label referencing PV name be enough to easily find Kubelet logs referring to PV. Wouldn't it? Adding trace information would allow us to separate requests from each other, but I don't think it's needed in this case. |
||||||||
### Summary of Cases | ||||||||
|
||||||||
- Given a component log(such as error log), find the API request that caused this (error) log. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think you can get back API request this way, you will only get request id. Possibly it would be useful to be able to correlate component logs to audit logs which store original request content. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm sorry, The current sentence seems to have been misleading. |
||||||||
- Given an API Request(such as suspicious API request), find the resulting component logs. | ||||||||
|
||||||||
### Logging metadata | ||||||||
|
||||||||
We use three logging meta-data, and propagate them each K8s component by using OpenTelemetry. | ||||||||
OpenTelemetry has SpanContext and [Baggage](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/baggage/api.md) which is used for propagation of K8s component. | ||||||||
|
||||||||
| meta-data name | feature | | ||||||||
| ------ | ------ | | ||||||||
| traceid | We use SpanContext.TraceID as traceid<br>traceid spans an user request.<br>traceid is unique for user's request | | ||||||||
| spanid | We use SpanContext.SpanID as spanid<br>spanid spans a controller action.<br>spanid is unique for controller action | | ||||||||
| initialtraceid | We implement new id(InitialTraceID) to SpanContext<br>We use [Baggage](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/baggage/api.md) to propagate initialtraceid<br/>We use UID of root object as initialtraceid<br>initialtraceid spans the entire object lifecycle. <br>initialtraceid is unique for related objects | | ||||||||
|
||||||||
All of three id's inception is from object creation and it dies with object deletion | ||||||||
|
||||||||
## Design Details | ||||||||
|
||||||||
### Prerequisite | ||||||||
|
||||||||
We need to consider three cases: | ||||||||
- Case1: Requests from kubectl that creating an object | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I read the section below, but I can't tell why it matters if a request comes from kubectl or not... What behavior changes for requests from kubectl vs requests from something else? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I update the "Design of ID propagation(controller)". It works differently depending on whether we use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mention of kubectl sounds like a detail that can be skipped. If mechanism is meant to work with any kubernetes client, we should skip mention of |
||||||||
- Case2: Requests from kubectl other than creating (e.g. updating, deleting) an object | ||||||||
- Case3: Requests from controllers | ||||||||
|
||||||||
The design below is based on the above three cases | ||||||||
|
||||||||
The following picture show our design![design](./overview.png) | ||||||||
|
||||||||
|
||||||||
|
||||||||
we don't have any modifications in kubectl in this design. | ||||||||
dashpole marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||
|
||||||||
### Design of ID propagation in apiserver | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I thought we weren't going to do any changes to the API Server as part of this KEP? Edit: I see... I think we can avoid these changes by using baggage. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This still says we are adding propagation in the API server. I don't think we want to do that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dashpole Here is a k8s demo and webhook demo. In the demo, we avoid modifying apiserver, by the following points:
Will the above two points be allowed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we don't do changes in apiserver, the TraceContext and Baggage in http header cannot propagate to webhook, right ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is already being done as part of https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/647-apiserver-tracing#tracing-api-requests. Feel free to reference that, but we don't need to propose it here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it will. I'm working on the implementation right now, and I hope to have it upstream in the next week or so. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it helps a lot, thank you very much. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are welcome to, although it may be less work to wait until it is merged :). |
||||||||
|
||||||||
**1. Do othttp's original [Extract](https://pkg.go.dev/go.opentelemetry.io/otel/api/propagators#TraceContext.Extract)() to get [SpanContext](https://pkg.go.dev/go.opentelemetry.io/otel/api/trace#SpanContext)** | ||||||||
|
||||||||
- For request from kubectl, SpanContext is null, do [Start](https://github.com/open-telemetry/opentelemetry-go/blob/3a9f5fe15f50a35ad8da5c5396a9ed3bbb82360c/sdk/trace/tracer.go#L38)() to start new SpanContext (new traceid and spanid) | ||||||||
- For request from controller we can get a valid SpanContext(including traceid and spanid), do [Start](https://github.com/open-telemetry/opentelemetry-go/blob/3a9f5fe15f50a35ad8da5c5396a9ed3bbb82360c/sdk/trace/tracer.go#L38)() to update the SpanContext (new spanid) | ||||||||
|
||||||||
**2. Chain SpanContext and initialtraceid to "r.ctx"** | ||||||||
|
||||||||
- we use r.ctx to propagate those IDs to next handler | ||||||||
|
||||||||
**3. Make up a new outgoing [request](#design-of-id-progagation-in-client-go)** | ||||||||
|
||||||||
### Design of ID propagation (controller) | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd like this split into a few sections:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks for your comments, i'm not sure if if have got all your point. i update the controller section like below. feel free to let me know if i got something wrong.
refer to 1. New context propagation library function
refer to 2. How to log the metadata from a context.Context
refer to 3. Propagate Span and Baggage with context.Context across object 1. New context propagation library function
SpanContextWithObject returns a context including Span and Baggage according to the passed object, it do something like below pseudo code:
2. How to log the metadata from a context.Context Below is an example to log the metadata
Please note that changing the log is not the goal of this KEP. 3. Propagate Span and Baggage with context.Context across object When controllers create/update/delete an object A based on another B, we propagate context from B to A. Below is an example to show how deployment propagate Span and Baggage to replicaset. func (dc *DeploymentController) getNewReplicaSet(d *apps.Deployment, rsList, oldRSs []*apps.ReplicaSet, createIfNotExisted bool) (*apps.ReplicaSet, error) {
+ ctx := httptrace.SpanContextWithObject(context.Background(), d)
existingNewRS := deploymentutil.FindNewReplicaSet(d, rsList)
@@ -220,7 +227,8 @@ func (dc *DeploymentController) getNewReplicaSet(d *apps.Deployment, rsList, old
// hash collisions. If there is any other error, we need to report it in the status of
// the Deployment.
alreadyExists := false
- createdRS, err := dc.client.AppsV1().ReplicaSets(d.Namespace).Create(context.TODO(), &newRS, metav1.CreateOptions{})
+ createdRS, err := dc.client.AppsV1().ReplicaSets(d.Namespace).Create(ctx, &newRS, metav1.CreateOptions{}) In order to propagate context across all objects we concerned , we also need to change some APIs by adding
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Would attaching a logr instance which then contains that object as an additional key/value item also work? It's a more flexible. The downside is that code then has to be converted to structured logging (but that's the goal anyway, right?) and that the object cannot be retrieved anymore separately (which may or may not be relevant). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for your comment. I want to know more details of your suggestion. Maybe I could not understand it... The final goal is output the tracking information using structured logging format. However, I assume that it is out of scope in this KEP. Of course, this KEP and implementation should be with structured logging in mind. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The goal is to to have something like https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/log/log.go in klog (perhaps as klog/kcontext/kcontext.go?), except that the Then gradually packages can be converted from logging via klog to logging via a logr instance that they get with that
That depends. If you are proposing that each log site should do something like "retrieve objects, log with klog.InfoS", then I think we should consider whether that is the best solution for the problem. We don't want to propose one solution now (this KEP) and then another solution later (a KEP for passing a logger via the context). As @dashpole said, it will be easier to discuss when you show how the proposed enhancement will be used by code in Kubernetes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to your comments by far, I feel that you mainly concern that how we modify the logs with the feature proposed in this KEP right? I just want to clarify that In our initial vision "how we change logs" is out of this KEP's scope. With David's earlier suggestion we decided to split it down, to just focus on "bring k8s a new feature which could generate and propagate unique IDs across different components and objects" first, and then propose to change logs after that, because it requires more consideration and discussion with sigs. But I realized that what you commented is worth considering, we will show some examples to answer your points, thanks. |
||||||||
|
||||||||
**1. Extract SpanContext and initialtraceid from annotation of object to golang ctx** | ||||||||
|
||||||||
**2. Propagate golang ctx from objects to API Calls** | ||||||||
|
||||||||
When controllers create/update/delete an object A based on another B, we propagate context from B to A. E.g.: | ||||||||
``` | ||||||||
ctx = httptrace.SpanContextFromAnnotations(objB.GetAnnotations()) | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: helper functions to modify context should also take a context as a parameter. E.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for your comment. I will fix it. |
||||||||
err = r.KubeClient.CoreV1().Create(ctx, objA...) | ||||||||
``` | ||||||||
We do propagation across objects without adding traces to that components. | ||||||||
|
||||||||
**3. Make up a new outgoing [request](#design-of-id-progagation-in-client-go)** | ||||||||
|
||||||||
### Design of ID propagation in [client-go](https://github.com/kubernetes/client-go) | ||||||||
client-go helps to inject [TraceContext](https://pkg.go.dev/go.opentelemetry.io/otel/propagators#example-TraceContext) and [Baggage](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/baggage/api.md) to the outgoing http request header, something changes are like below: | ||||||||
```diff | ||||||||
@@ -868,6 +871,7 @@ func (r *Request) request(ctx context.Context, fn func(*http.Request, *http.Resp | ||||||||
} | ||||||||
req = req.WithContext(ctx) | ||||||||
req.Header = r.headers | ||||||||
+ propagator := otel.NewCompositeTextMapPropagator(propagators.TraceContext{}, propagators.Baggage{}) | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should do this by using the |
||||||||
+ propagator.Inject(ctx, req.Header) | ||||||||
|
||||||||
r.backoff.Sleep(r.backoff.CalculateBackoff(r.URL())) | ||||||||
if retries > 0 { | ||||||||
``` | ||||||||
apiserver and controller use the API to make up a new outgoing request. | ||||||||
|
||||||||
### Design of Mutating webhook(Out of tree) | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this needs some introduction here. Why do we need an admission controller at all? Reasons I can think of include:
|
||||||||
|
||||||||
**1.Extract SpanContext and initialtraceid from request's header ** | ||||||||
|
||||||||
**2.Update SpanContext and initialtraceid to object** | ||||||||
|
||||||||
- For traceid and spanid and traceflags(sampled/not sampled) | ||||||||
|
||||||||
Always set the trace context annotation based on the incoming request. | ||||||||
|
||||||||
- For initialtraceid | ||||||||
|
||||||||
if initialtraceid is nil, Use the UID of object as initialtraceid, else, leave initialtraceid as it is. | ||||||||
|
||||||||
|
||||||||
### Risks and Mitigations | ||||||||
|
||||||||
TBD | ||||||||
|
||||||||
### Test Plan | ||||||||
|
||||||||
All added code will be covered by unit tests. | ||||||||
|
||||||||
### Graduation Criteria | ||||||||
|
||||||||
#### Alpha | ||||||||
|
||||||||
- Feature covers 3 important workload objects: Deployment, Statefulset, Daemonset | ||||||||
- Related unit tests described in this KEP are completed | ||||||||
|
||||||||
#### Beta | ||||||||
|
||||||||
- Feature covers other objects which not limited to ownerRef relationship | ||||||||
- All necessary tests are completed | ||||||||
|
||||||||
#### GA | ||||||||
|
||||||||
- Feedback about this feature is collected and addressed | ||||||||
- Enabled in Beta for at least two releases without complaints | ||||||||
|
||||||||
## Production Readiness Review Questionnaire | ||||||||
|
||||||||
### Feature Enablement and Rollback | ||||||||
|
||||||||
_This section must be completed when targeting alpha to a release._ | ||||||||
|
||||||||
* **How can this feature be enabled / disabled in a live cluster?** | ||||||||
- [x] Feature gate (also fill in values in `kep.yaml`) | ||||||||
- Feature gate name: ObjectTraceIDs | ||||||||
- Components depending on the feature gate: kube-apiserver, kube-controller-manager | ||||||||
- [ ] Other | ||||||||
- Describe the mechanism: | ||||||||
- Will enabling / disabling the feature require downtime of the control | ||||||||
plane? | ||||||||
- Will enabling / disabling the feature require downtime or reprovisioning | ||||||||
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). | ||||||||
|
||||||||
* **Does enabling the feature change any default behavior?** | ||||||||
|
||||||||
In apiserver, new request handlers added by this feature will generate | ||||||||
or update the trace information (ids), then the ids will be added to the | ||||||||
object's annotation by the webhook provided by this feature. | ||||||||
|
||||||||
In controller-manager, when sending request to apiserver, it will get the | ||||||||
ids from the referenced object's annotation and inject the ids into the | ||||||||
outgoing request header with the W3C format. | ||||||||
|
||||||||
* **Can the feature be disabled once it has been enabled (i.e. can we roll back | ||||||||
the enablement)?** | ||||||||
Yes. | ||||||||
|
||||||||
* **What happens if we reenable the feature if it was previously rolled back?** | ||||||||
Objects created during the rollback will have no initialtrace-id until they | ||||||||
are recreated. | ||||||||
|
||||||||
* **Are there any tests for feature enablement/disablement?** | ||||||||
Unit test can ensure that the feature enablement/disablement is valid. | ||||||||
|
||||||||
### Rollout, Upgrade and Rollback Planning | ||||||||
|
||||||||
_This section must be completed when targeting beta graduation to a release._ | ||||||||
|
||||||||
* **How can a rollout fail? Can it impact already running workloads?** | ||||||||
The proposed ids will not be added or updated. It will not have impact | ||||||||
on running workloads. | ||||||||
|
||||||||
* **What specific metrics should inform a rollback?** | ||||||||
N/A | ||||||||
|
||||||||
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?** | ||||||||
N/A | ||||||||
|
||||||||
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs, | ||||||||
fields of API types, flags, etc.?** | ||||||||
N/A | ||||||||
|
||||||||
### Monitoring Requirements | ||||||||
|
||||||||
_This section must be completed when targeting beta graduation to a release._ | ||||||||
|
||||||||
* **How can an operator determine if the feature is in use by workloads?** | ||||||||
It can be determined by checking objects annotation with proposed ids. | ||||||||
|
||||||||
* **What are the SLIs (Service Level Indicators) an operator can use to determine | ||||||||
the health of the service?** | ||||||||
N/A | ||||||||
|
||||||||
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** | ||||||||
N/A | ||||||||
|
||||||||
* **Are there any missing metrics that would be useful to have to improve observability | ||||||||
of this feature?** | ||||||||
N/A | ||||||||
|
||||||||
### Dependencies | ||||||||
|
||||||||
_This section must be completed when targeting beta graduation to a release._ | ||||||||
|
||||||||
* **Does this feature depend on any specific services running in the cluster?** | ||||||||
N/A | ||||||||
|
||||||||
### Scalability | ||||||||
|
||||||||
_For alpha, this section is encouraged: reviewers should consider these questions | ||||||||
and attempt to answer them._ | ||||||||
|
||||||||
_For beta, this section is required: reviewers must answer these questions._ | ||||||||
|
||||||||
_For GA, this section is required: approvers should be able to confirm the | ||||||||
previous answers based on experience in the field._ | ||||||||
|
||||||||
* **Will enabling / using this feature result in any new API calls?** | ||||||||
N/A | ||||||||
|
||||||||
* **Will enabling / using this feature result in introducing new API types?** | ||||||||
N/A | ||||||||
|
||||||||
* **Will enabling / using this feature result in any new calls to the cloud | ||||||||
provider?** | ||||||||
N/A | ||||||||
|
||||||||
* **Will enabling / using this feature result in increasing size or count of | ||||||||
the existing API objects?** | ||||||||
N/A | ||||||||
|
||||||||
* **Will enabling / using this feature result in increasing time taken by any | ||||||||
operations covered by [existing SLIs/SLOs]?** | ||||||||
N/A | ||||||||
|
||||||||
* **Will enabling / using this feature result in non-negligible increase of | ||||||||
resource usage (CPU, RAM, disk, IO, ...) in any components?** | ||||||||
TBD | ||||||||
|
||||||||
### Troubleshooting | ||||||||
|
||||||||
The Troubleshooting section currently serves the `Playbook` role. We may consider | ||||||||
splitting it into a dedicated `Playbook` document (potentially with some monitoring | ||||||||
details). For now, we leave it here. | ||||||||
|
||||||||
_This section must be completed when targeting beta graduation to a release._ | ||||||||
|
||||||||
* **How does this feature react if the API server and/or etcd is unavailable?** | ||||||||
The feature will be unavailable. | ||||||||
|
||||||||
* **What are other known failure modes?** | ||||||||
TBD | ||||||||
|
||||||||
* **What steps should be taken if SLOs are not being met to determine the problem?** | ||||||||
N/A | ||||||||
|
||||||||
## Implementation History | ||||||||
|
||||||||
* 2020-09-01: KEP proposed | ||||||||
* 2020-09-28: PRR questionnaire updated | ||||||||
* [Mutating admission webhook which injects trace context for demo](https://github.com/Hellcatlk/mutating-trace-admission-controller/tree/trace-ot) | ||||||||
* [Instrumentation of Kubernetes components for demo](https://github.com/Hellcatlk/kubernetes/tree/trace-ot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should include that the metadata is coming from an annotation. That is probably the most important part of this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, we will update it later