Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose log tracking KEP #1961

Closed
wants to merge 41 commits into from
Closed

Conversation

hase1128
Copy link

@hase1128 hase1128 commented Sep 1, 2020

I created a new KEP instead of Request-ID KEP.
The design of the new KEP was discussed in the slack thread.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 1, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @hase1128. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 1, 2020
@k8s-ci-robot k8s-ci-robot added sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Sep 1, 2020
@hase1128
Copy link
Author

hase1128 commented Sep 1, 2020

@dashpole @serathius

This KEP isn't finished yet (most of it is a copy of the content of discussion in Slack), but I'll tell you because I created the KEP for now.
I will be vacation from 9/2 to 9/7, so I will add and modify missing parts of this KEP after 9/8.

@hase1128
Copy link
Author

hase1128 commented Sep 1, 2020

/assign @dashpole

@ehashman
Copy link
Member

ehashman commented Sep 3, 2020

@hase1128 I believe this KEP is based on an older template that is missing a number of questions for production-readiness that need to be answered. Can you please rebase this based on the new template?

@ehashman
Copy link
Member

ehashman commented Sep 3, 2020

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 3, 2020
@serathius
Copy link
Contributor

/cc @44past4

@k8s-ci-robot k8s-ci-robot requested a review from 44past4 September 7, 2020 09:09
@hase1128
Copy link
Author

hase1128 commented Sep 8, 2020

@hase1128 I believe this KEP is based on an older template that is missing a number of questions for production-readiness that need to be answered. Can you please rebase this based on the new template?

@ehashman
I see. I will rebase later.

@hase1128 hase1128 mentioned this pull request Sep 8, 2020
@ehashman
Copy link
Member

/retitle [WIP] Propose log tracking KEP

Please remove WIP when KEP is ready for review, thanks!

@k8s-ci-robot k8s-ci-robot changed the title Propose log tracking KEP [WIP] Propose log tracking KEP Sep 10, 2020
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 10, 2020
Copy link
Member

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some format issues here:

  • the dir should be keps/sig-instrumentation/1668-log-tracking/

  • this file (20200901-log-tracking.md ) should be renamed README.md

  • a kep.yaml file which will contain milestones, status and other info about the KEP should also be included in this dir

The templates that you'd use for your KEP can be found here:
https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template

Also, as a reminder to be included in a release:

  • The KEP must be merged in an implementable state
  • The KEP must have test plans
  • The KEP must have graduation criteria.

As an example please see: https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/1602-structured-logging

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 18, 2020
@ehashman
Copy link
Member

ehashman commented Jan 4, 2021

@dashpole @serathius and related person

We update our KEP. Could you please review our KEP?
(Now, this KEP has a PRR test error, I will check the reason. If you know what to do, please let me know...)

PRR approval is now required for the 1.21 milestone. Without it, this KEP cannot merge. This is new and I don't think anyone has the correct OWNERS/directory set up, I will follow up with the release team.

We will also need at least one non-Google approver for this PR; currently all listed reviewers/approvers appear to be employed at Google. Can you please work to identify one?

@hase1128
Copy link
Author

hase1128 commented Jan 5, 2021

@ehashman

thanks the comments and kindly follow up.

We will also need at least one non-Google approver for this PR; currently all listed reviewers/approvers appear to be employed at Google. Can you please work to identify one?

I would be very happy if you could be one of approver.

@ehashman
Copy link
Member

ehashman commented Jan 6, 2021

I suppose it has to be either myself or @brancz :)

I sent an email about the new mandatory PRR review to k-dev, it seems to have flown under people's radars. We should be discussing at the leads meetings next week.

@hase1128
Copy link
Author

hase1128 commented Jan 7, 2021

I suppose it has to be either myself or @brancz :)

Thank you. At the final approval of the KEP design, let me ask one of you for approval.

I sent an email about the new mandatory PRR review to k-dev, it seems to have flown under people's radars. We should be discussing at the leads meetings next week.

I will continue to pay attention the PRR situation.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 8, 2021
@k8s-ci-robot
Copy link
Contributor

@hase1128: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-enhancements-verify 34d8f47 link /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@fenggw-fnst
Copy link

We have updated the part of "3. How to log the metadata from a context.Context" and added an example. PTAL, thanks.
@pohly @dashpole

| ------ | ------ |
| traceid | spans an user request. unique for user's request |
| spanid | spans a controller action. unique for controller actions |
| objsetid | spans the entire object lifecycle. unique for a related object set |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean as an object set here? First sentence refers to object lifecycle, but we already have a unique identifier for objects uid. What's the difference here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one related objects may include a deploment, a repliset and some pods, these objects have the same objsetid. below is a group related objects example:

deployment:
  uid: D123456
  annotations:
     - objsetid: D123456
     - traceid: t1234567890
     - spanid: s123456

---
repliset:
  uid: R123456
  annotations:
     - objsetid: D123456  <<< same with deployment's objsectid
     - traceid: t0123456789
     - spanid: s012345
---
pod:
  uid: P123456
  annotations:
     - objsetid: D123456  <<< same with deployment's objsectid
     - traceid: t9012345678
     - spanid: s501234

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be a rootid ? I have never heard term object set so correct me if I'm wrong. I think we should look into other relations between objects that we also want to trace, for example PV,PVC. In that case term object set is hard to apply.

| spanid | spans a controller action. unique for controller actions |
| objsetid | spans the entire object lifecycle. unique for a related object set |

### Opt-in for Components
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should develop features that don't have a dedicated minimal target adoption. We should define a set of components that we plan to introduce changes so that they are useful for users.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that we should show the components which will be changed, in this KEP doc?
Our purpose is to identify the specific logs related to an user request and object, so I think the comannds/components are kubectl apply/create/delete, deployment/repricaset/pod and so on.

### Opt-in for Components

Updating/changing the logs is outside the scope of this KEP. To actually add metadata to K8s component logs, the following procedures are needed in addition.
- Open issues for each component, and discuss them with the SIGs that own that component.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot push work on other sigs and expect that they will do it on base that it's important for sig-instrumentation. We should scope work to set of components that will allow it to be beneficial and finishable by our sig-instrumentation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For our logging implementation, I think we can just open issues for each component, and discuss them with the SIGs that own that component. Do you think this plan is "push work on other sigs"?
Should we open the new logging KEP and discuss for each component?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's generally easier to target SIGs individually, since KEPs can easily balloon (causing unnecessary bikeshedding). Insofar as practically applying @serathius' suggestion, it may make sense to separate 'core functionality', i.e. library code for tracing, from the component integration story.


Tracking logs among each Kubernetes component related to specific an user operation and objects is very tough work.
It is necessary to match logs by basically using timestamps and object's name as hints.
If multiple users throw many API requests at the same time, it is very difficult to track logs across each Kubernetes component log.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this is currently unsolved problem, but could you provide some info why this problem is worth solving? Why tracking user operation is important? Examples: Debugging, auditing, reproducing problems?

### Goals

- Implement method which propagates new logging metadata among each K8s components
- Store and update logging metadata into object annotation through mutating admission controller(aka Webhook)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You put some implementation details (like object annotation, mutating admission controller) in goals I understand that we would like use them to achieve goal of adding metadata, but should their usage be goal per se?

I think goal should stay on level that describes what result we are expecting from user/developer point of view.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, we will update.


### Non-Goals

- Add new logging metadata into actual K8s component logs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If goal of this KEP is introducing trace information into logs then we skip this process. If the main benefit of KEP is introducing this information then KEP should include this work. Possibly in reduced scope, but enough to provide a benefit.

For example structured logging KEP goal was not to rewrite all logs, but to provide interface and use it in enough places to provide benefit (25 log line changes was enough to improve 99.9% of log output). Maybe same approach would work here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you said, we are now considering the separation to tracing and logging. If we separate, I think "Add new logging metadata into actual K8s component logs" is non-goal. In the future, we want to realize this work.

Thank you for your suggestion about the method how you did in structured logging. Maybe it helps our work.


- Add new logging metadata into actual K8s component logs
- This task will be done by opening issues after completing this KEP
- To centrally manage the logs of each Kubernetes component with objsetid(This can be realized with existing OSS such as Kibana, so no need to implement into Kubernetes components).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't understand this point. What do you mean by manage the logs of each component ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that we will not implement the feature(log analysis/query/visualization) which is realized with existing OSS such as ELK. We had better to write more details.

- This proposal does not add additional telemetry to any components, just context-based metadata to existing logs. This KEP doesn't require running any additional OpenTelemetry components (such as the OpenTelemetry collector, which the [API Server Tracing](https://github.com/kubernetes/enhancements/issues/647) KEP uses)

## Background
It's Known that all requests to K8s will reach API Server first, in order to propagate these metadata to mutating admission controller, we require API Server to support propogating these metadata as well. Fortunately there is already a KEP [API Server Tracing](https://github.com/kubernetes/enhancements/issues/647) to do this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this up to date with latest changes to API Server Tracing KEP?

This KEP covers passing request trace information, but from your description you still assume object tracing? Are we should that this is compatible? I expect opentelemetry to utilize HTTP request headers to pass request trace information this approach is no K8s object aware.

@dashpole to confirm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API Server tracing will propagate trace context from incoming requests to outgoing requests, including to etcd, re-entrant apiserver calls, and to webhooks. This section isn't saying that the API server will propagate context to objects, just that the API server will propagate context to a webhook.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I was just confused as Summary section states This KEP proposes a method for adding new three unique logging metadata into annotation of objects and propagating them across components which is not compatible with with what's described here.

In this case, the volume generation on the storage side is OK, and there is a possibility that the mount process to the container in the pod is NG.
In order to identify the cause, it is necessary to look for the problem area while checking the component (kubelet) log as well as the system side syslog and mount related settings.

This log tracking feature is useful to identify the logs related to specific user operation and cluster processing, and can reduce investigation cost in such cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This story described need to identify all logs related to a PV in multiple kubernetes components. Is object tracing needed in such case? I think having a annotation/label referencing PV name be enough to easily find Kubelet logs referring to PV. Wouldn't it? Adding trace information would allow us to separate requests from each other, but I don't think it's needed in this case.

This log tracking feature is useful to identify the logs related to specific user operation and cluster processing, and can reduce investigation cost in such cases.
### Summary of Cases

- Given a component log(such as error log), find the API request that caused this (error) log.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can get back API request this way, you will only get request id. Possibly it would be useful to be able to correlate component logs to audit logs which store original request content.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, The current sentence seems to have been misleading.
This sentence meant that we can traceback the log messages closest to the API request easily, and then find out the API request via audit logs as you said or some ways else.

Copy link
Contributor

@serathius serathius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two main threads in this KEP,

  • object tracing, picked up after original Tracing KEP reduced scope
  • adding context information to logs

Both are deep topics that can lead to lot of discussion, making it hard to move forward with the design.

I think focusing this KEP on only one of those problems would help moving forward. Object tracing problem is mainly related to api machinery, storing, passing and retrieving trace information (complicated design, small implementation). On the other hand context logging is mostly design of golang interface for logging library and changing logging library handling (from global klog methods to one based on context), (simple design, but complicated implementation/code migration).

As this KEP has put logging implementation out of scope (Add new logging metadata into actual K8s component logs is in no-goals) I would propose to focus this KEP on object tracing.

@ehashman
Copy link
Member

Once @serathius's feedback is addressed I will give this a review.

@fenggw-fnst
Copy link

We have rewritten the KEP and opened a new PR#2312 to focus on only trace propagating as Marek suggested.


This KEP proposes a method for adding new three unique logging meta-data into K8s component logs.
It makes us more easy to identify specific logs related to an user request (such as `kubectl apply`) and object (such as Pod, Deployment).
It is expected to reduce investigation cost greatly when trouble shoothing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/trouble shoothing/troubleshooting/

### Opt-in for Components

Updating/changing the logs is outside the scope of this KEP. To actually add metadata to K8s component logs, the following procedures are needed in addition.
- Open issues for each component, and discuss them with the SIGs that own that component.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's generally easier to target SIGs individually, since KEPs can easily balloon (causing unnecessary bikeshedding). Insofar as practically applying @serathius' suggestion, it may make sense to separate 'core functionality', i.e. library code for tracing, from the component integration story.

@fenggw-fnst
Copy link

It's generally easier to target SIGs individually, since KEPs can easily balloon (causing unnecessary bikeshedding). Insofar as practically applying @serathius' suggestion, it may make sense to separate 'core functionality', i.e. library code for tracing, from the component integration story.

Thanks for review, we have sent the new PR as @serathius suggested. PTAL, any comments would be appreciated.

@ehashman
Copy link
Member

We have rewritten the KEP and opened a new PR#2312 to focus on only trace propagating as Marek suggested.

Does that mean this should be closed in favour of #2312 ?

@fenggw-fnst
Copy link

We have rewritten the KEP and opened a new PR#2312 to focus on only trace propagating as Marek suggested.

Does that mean this should be closed in favour of #2312 ?

Yes, I think this one can be closed, new discussion should be at #2312

@ehashman
Copy link
Member

Okay, I will close this PR.

@ehashman ehashman closed this Jan 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.