-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose log tracking KEP #1961
Propose log tracking KEP #1961
Conversation
Hi @hase1128. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This KEP isn't finished yet (most of it is a copy of the content of discussion in Slack), but I'll tell you because I created the KEP for now. |
/assign @dashpole |
@hase1128 I believe this KEP is based on an older template that is missing a number of questions for production-readiness that need to be answered. Can you please rebase this based on the new template? |
/ok-to-test |
/cc @44past4 |
/retitle [WIP] Propose log tracking KEP Please remove WIP when KEP is ready for review, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some format issues here:
-
the dir should be keps/sig-instrumentation/1668-log-tracking/
-
this file (20200901-log-tracking.md ) should be renamed README.md
-
a kep.yaml file which will contain milestones, status and other info about the KEP should also be included in this dir
The templates that you'd use for your KEP can be found here:
https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template
Also, as a reminder to be included in a release:
- The KEP must be merged in an implementable state
- The KEP must have test plans
- The KEP must have graduation criteria.
As an example please see: https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/1602-structured-logging
PRR approval is now required for the 1.21 milestone. Without it, this KEP cannot merge. This is new and I don't think anyone has the correct OWNERS/directory set up, I will follow up with the release team. We will also need at least one non-Google approver for this PR; currently all listed reviewers/approvers appear to be employed at Google. Can you please work to identify one? |
thanks the comments and kindly follow up.
I would be very happy if you could be one of approver. |
I suppose it has to be either myself or @brancz :) I sent an email about the new mandatory PRR review to k-dev, it seems to have flown under people's radars. We should be discussing at the leads meetings next week. |
Thank you. At the final approval of the KEP design, let me ask one of you for approval.
I will continue to pay attention the PRR situation. |
Signed-off-by: Li Zhijian <[email protected]>
update log example
@hase1128: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
| ------ | ------ | | ||
| traceid | spans an user request. unique for user's request | | ||
| spanid | spans a controller action. unique for controller actions | | ||
| objsetid | spans the entire object lifecycle. unique for a related object set | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean as an object set
here? First sentence refers to object lifecycle, but we already have a unique identifier for objects uid
. What's the difference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one related objects may include a deploment, a repliset and some pods, these objects have the same objsetid. below is a group related objects example:
deployment:
uid: D123456
annotations:
- objsetid: D123456
- traceid: t1234567890
- spanid: s123456
---
repliset:
uid: R123456
annotations:
- objsetid: D123456 <<< same with deployment's objsectid
- traceid: t0123456789
- spanid: s012345
---
pod:
uid: P123456
annotations:
- objsetid: D123456 <<< same with deployment's objsectid
- traceid: t9012345678
- spanid: s501234
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be a rootid
? I have never heard term object set
so correct me if I'm wrong. I think we should look into other relations between objects that we also want to trace, for example PV,PVC. In that case term object set
is hard to apply.
| spanid | spans a controller action. unique for controller actions | | ||
| objsetid | spans the entire object lifecycle. unique for a related object set | | ||
|
||
### Opt-in for Components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should develop features that don't have a dedicated minimal target adoption. We should define a set of components that we plan to introduce changes so that they are useful for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that we should show the components which will be changed, in this KEP doc?
Our purpose is to identify the specific logs related to an user request and object, so I think the comannds/components are kubectl apply/create/delete
, deployment/repricaset/pod and so on.
### Opt-in for Components | ||
|
||
Updating/changing the logs is outside the scope of this KEP. To actually add metadata to K8s component logs, the following procedures are needed in addition. | ||
- Open issues for each component, and discuss them with the SIGs that own that component. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot push work on other sigs and expect that they will do it on base that it's important for sig-instrumentation. We should scope work to set of components that will allow it to be beneficial and finishable by our sig-instrumentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For our logging implementation, I think we can just open issues for each component, and discuss them with the SIGs that own that component. Do you think this plan is "push work on other sigs"?
Should we open the new logging KEP and discuss for each component?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's generally easier to target SIGs individually, since KEPs can easily balloon (causing unnecessary bikeshedding). Insofar as practically applying @serathius' suggestion, it may make sense to separate 'core functionality', i.e. library code for tracing, from the component integration story.
|
||
Tracking logs among each Kubernetes component related to specific an user operation and objects is very tough work. | ||
It is necessary to match logs by basically using timestamps and object's name as hints. | ||
If multiple users throw many API requests at the same time, it is very difficult to track logs across each Kubernetes component log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that this is currently unsolved problem, but could you provide some info why this problem is worth solving? Why tracking user operation is important? Examples: Debugging, auditing, reproducing problems?
### Goals | ||
|
||
- Implement method which propagates new logging metadata among each K8s components | ||
- Store and update logging metadata into object annotation through mutating admission controller(aka Webhook) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You put some implementation details (like object annotation
, mutating admission controller
) in goals I understand that we would like use them to achieve goal of adding metadata, but should their usage be goal per se?
I think goal should stay on level that describes what result we are expecting from user/developer point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, we will update.
|
||
### Non-Goals | ||
|
||
- Add new logging metadata into actual K8s component logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If goal of this KEP is introducing trace information into logs then we skip this process. If the main benefit of KEP is introducing this information then KEP should include this work. Possibly in reduced scope, but enough to provide a benefit.
For example structured logging KEP goal was not to rewrite all logs, but to provide interface and use it in enough places to provide benefit (25 log line changes was enough to improve 99.9% of log output). Maybe same approach would work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you said, we are now considering the separation to tracing and logging. If we separate, I think "Add new logging metadata into actual K8s component logs" is non-goal. In the future, we want to realize this work.
Thank you for your suggestion about the method how you did in structured logging. Maybe it helps our work.
|
||
- Add new logging metadata into actual K8s component logs | ||
- This task will be done by opening issues after completing this KEP | ||
- To centrally manage the logs of each Kubernetes component with objsetid(This can be realized with existing OSS such as Kibana, so no need to implement into Kubernetes components). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't understand this point. What do you mean by manage the logs of each component
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that we will not implement the feature(log analysis/query/visualization) which is realized with existing OSS such as ELK. We had better to write more details.
- This proposal does not add additional telemetry to any components, just context-based metadata to existing logs. This KEP doesn't require running any additional OpenTelemetry components (such as the OpenTelemetry collector, which the [API Server Tracing](https://github.com/kubernetes/enhancements/issues/647) KEP uses) | ||
|
||
## Background | ||
It's Known that all requests to K8s will reach API Server first, in order to propagate these metadata to mutating admission controller, we require API Server to support propogating these metadata as well. Fortunately there is already a KEP [API Server Tracing](https://github.com/kubernetes/enhancements/issues/647) to do this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this up to date with latest changes to API Server Tracing KEP?
This KEP covers passing request trace information, but from your description you still assume object tracing? Are we should that this is compatible? I expect opentelemetry to utilize HTTP request headers to pass request trace information this approach is no K8s object aware.
@dashpole to confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API Server tracing will propagate trace context from incoming requests to outgoing requests, including to etcd, re-entrant apiserver calls, and to webhooks. This section isn't saying that the API server will propagate context to objects, just that the API server will propagate context to a webhook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I was just confused as Summary
section states This KEP proposes a method for adding new three unique logging metadata into annotation of objects and propagating them across components
which is not compatible with with what's described here.
In this case, the volume generation on the storage side is OK, and there is a possibility that the mount process to the container in the pod is NG. | ||
In order to identify the cause, it is necessary to look for the problem area while checking the component (kubelet) log as well as the system side syslog and mount related settings. | ||
|
||
This log tracking feature is useful to identify the logs related to specific user operation and cluster processing, and can reduce investigation cost in such cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This story described need to identify all logs related to a PV in multiple kubernetes components. Is object tracing needed in such case? I think having a annotation/label referencing PV name be enough to easily find Kubelet logs referring to PV. Wouldn't it? Adding trace information would allow us to separate requests from each other, but I don't think it's needed in this case.
This log tracking feature is useful to identify the logs related to specific user operation and cluster processing, and can reduce investigation cost in such cases. | ||
### Summary of Cases | ||
|
||
- Given a component log(such as error log), find the API request that caused this (error) log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you can get back API request this way, you will only get request id. Possibly it would be useful to be able to correlate component logs to audit logs which store original request content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, The current sentence seems to have been misleading.
This sentence meant that we can traceback the log messages closest to the API request easily, and then find out the API request via audit logs as you said or some ways else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see two main threads in this KEP,
- object tracing, picked up after original Tracing KEP reduced scope
- adding context information to logs
Both are deep topics that can lead to lot of discussion, making it hard to move forward with the design.
I think focusing this KEP on only one of those problems would help moving forward. Object tracing problem is mainly related to api machinery, storing, passing and retrieving trace information (complicated design, small implementation). On the other hand context logging is mostly design of golang interface for logging library and changing logging library handling (from global klog methods to one based on context), (simple design, but complicated implementation/code migration).
As this KEP has put logging implementation out of scope (Add new logging metadata into actual K8s component logs
is in no-goals) I would propose to focus this KEP on object tracing.
Once @serathius's feedback is addressed I will give this a review. |
We have rewritten the KEP and opened a new PR#2312 to focus on only trace propagating as Marek suggested. |
|
||
This KEP proposes a method for adding new three unique logging meta-data into K8s component logs. | ||
It makes us more easy to identify specific logs related to an user request (such as `kubectl apply`) and object (such as Pod, Deployment). | ||
It is expected to reduce investigation cost greatly when trouble shoothing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/trouble shoothing/troubleshooting/
### Opt-in for Components | ||
|
||
Updating/changing the logs is outside the scope of this KEP. To actually add metadata to K8s component logs, the following procedures are needed in addition. | ||
- Open issues for each component, and discuss them with the SIGs that own that component. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's generally easier to target SIGs individually, since KEPs can easily balloon (causing unnecessary bikeshedding). Insofar as practically applying @serathius' suggestion, it may make sense to separate 'core functionality', i.e. library code for tracing, from the component integration story.
Thanks for review, we have sent the new PR as @serathius suggested. PTAL, any comments would be appreciated. |
Does that mean this should be closed in favour of #2312 ? |
Okay, I will close this PR. |
I created a new KEP instead of Request-ID KEP.
The design of the new KEP was discussed in the slack thread.