Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KEP for Request ID #1348

Closed
wants to merge 9 commits into from
Closed

Conversation

hase1128
Copy link

@hase1128 hase1128 commented Nov 1, 2019

This PR is KEP request.
I would like to discuss about this KEP with SIG-Auth members.

@k8s-ci-robot
Copy link
Contributor

Welcome @hase1128!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 1, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @hase1128. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 1, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hase1128
To complete the pull request process, please assign enj
You can assign the PR to them by writing /assign @enj in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/auth Categorizes an issue or PR as relevant to SIG Auth. labels Nov 1, 2019
@k8s-ci-robot k8s-ci-robot requested review from enj and tallclair November 1, 2019 11:23
@hase1128
Copy link
Author

hase1128 commented Nov 1, 2019

/assign @enj

@mikedanese
Copy link
Member

cc @dashpole

@dashpole
Copy link
Contributor

dashpole commented Nov 4, 2019

Hey @hase1128, it looks like we have some similar goals. #650 is very similar in that we want to propagate an ID (the trace ID in that case) to all of the controllers so they can emit traces associated with the initial request. While #650 is about tracing, I agree that logging would benefit from being able to associate with a request as well.

Out of curiosity, what was your plan to send the "Request ID" to the kubelet?

@hase1128
Copy link
Author

hase1128 commented Nov 6, 2019

@dashpole
Thank you for the information about similar KEP.
I am checking #650 now, and I'm thinking that I would like to collaborate with existing KEP.
I may work in #650, or I may work in this KEP(if this KEP has additional point against #650 ).
It depends on the difference between these two KEPs, I think.

Out of curiosity, what was your plan to send the "Request ID" to the kubelet?

Current my plan is matching each k8s logs related to single API request by referring some ID information(e.g. audit-id in audit-log, UUID of resource, user-id, etc.) which is written in each k8s log automatically when processing an API request.
However, I don't clarify what kind of ID information is recorded in each k8s log in current k8s implementation yet.
If kubelet log doesn't have any useful ID information now, I need to consider another plan.

@hase1128
Copy link
Author

hase1128 commented Nov 7, 2019

@dashpole
I checked #650(but I didn't check all comments because #650 is too long...)
#650 seems to be in discussion in provisional status for about one year.

I think it is better that we start from small function and simple implementation.
And then, develop additional function step by step.

For example, we may be able to match each k8s logs related to single API request from user with only exsisting k8s function.

  • Audit log may have API request from user and it's response information
    (e.g. API request includes username, uid, source ip, request url, etc. API Response includes uuid, etc.)
  • some k8s logs may have uuid

So, we may be able to match logs by using uuid information.(but we can match only logs which include uuid)
we also know simple latency information, because each k8s log have timestamp.
This means we can know time delta between logs after matching related k8s logs by uuid.
Can this satisfy minimum requirement of a part of #650 goals?

I would like to collaborate with you for moving forward towards our goals.
But I don't grasp your current status, so I would like to talk with you by telephone(Zoom or Skype) if you are interested my idea.
I would like to know current status and discuss future direction.
(BTW, I am going to participate KubeCon San Diego. I can also talk by F2F.)

@dashpole
Copy link
Contributor

dashpole commented Nov 7, 2019

It sounds like your interest is primarily in audit logs, which don't require any sort of context propagation. I didn't work on #650 for most of this year, and picked it up again in the last few months. I don't think matching by uuid is what i'm looking for, but it sounds like it meets your needs. I will be at kubecon as well, and will be presenting on #650 in the sig-instrumentation deep dive.

@brancz
Copy link
Member

brancz commented Nov 8, 2019

I agree with the general goal of these, but think similarly to @dashpole that this is best solved as part of the existing KEP that is already being worked on by people.

@hase1128
Copy link
Author

hase1128 commented Nov 11, 2019

@brancz
Understood.
I would like to incorporate this KEP or existing KEP into RHOCP in the future.
In the current situation, it is better that I cooperate with the existing KEP, and consider how I can contribute.

@hase1128
Copy link
Author

hase1128 commented Nov 12, 2019

@dashpole

I will be at kubecon as well, and will be presenting on #650 in the sig-instrumentation deep dive.

I'm going to go your deep dive session. I would like to cooperate with you on #650.
So, I will participate your session and understand the detail of #650 at first.
After that, I consider how I can contribute. If you like, I'd like to talk a little with you after that session.

@logicalhan
Copy link
Member

/sig instrumentation
/cc @brancz

@k8s-ci-robot k8s-ci-robot requested a review from brancz November 21, 2019 00:21
@k8s-ci-robot k8s-ci-robot added the sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. label Nov 21, 2019
@hase1128
Copy link
Author

@brancz
I would like to discuss the following at the next SIG instrumentation meeting.
How to proceed KEP#1348
Background of this discussion

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 12, 2019
@sftim
Copy link
Contributor

sftim commented Feb 22, 2020

/retest

@k8s-ci-robot
Copy link
Contributor

@hase1128: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-enhancements-verify 1383a34 link /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@serathius
Copy link
Contributor

/assign


#### Case 1

In case of insecure or unauthorized operation happens, it is necessary to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should more like in scope of Audit Logging. https://kubernetes.io/docs/tasks/debug-application-cluster/audit/
Please be precise about use cases that audit logs cannot solve.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added additional description.

### Goals

- Adding a Request-ID into each K8s component log.
- The Request-ID is unique to an operation.
Copy link
Contributor

@serathius serathius Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an operation here? kubectl request, Operator request etc.

Problem: most requests in cluster are not done by user, but some controller that works on user behalf.

Example user sends 1 request about creating Deployment. This leads to tens of requests done to reconcile Deployment state. Deployment creates replica set. Replicaset creates pods. Pods are scheduled etc. Everything is a request here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operation means kubectl command. we would like to associate logs related to klog calls which called by functions and APIs via kubectl command. I added more explanation.

This KEP proposes a new unique logging meta-data into all Kubernetes logs. It makes us
more easy to identify specific logs related to a single user operation (such as
`kubectl apply -f <my-pod.yaml>`). This feature is similar to
[Request-ID](https://docs.openstack.org/api-guide/compute/faults.html) for
Copy link
Contributor

@serathius serathius Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Openstack defines two types of Request-ID. Local and Global one.

Based on further text I guess that you mean Global Request ID. Please make it more explicit in KEP.

I would propose to rename all references to "Request-ID" to "Trace-ID" to reduce naming confusion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, It means global request-id. I changed to Global request id in the above sentence.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose to rename all references to "Request-ID" to "Trace-ID" to reduce naming confusion.

OK, I consider this and may update KEP later.

- We just use Tracing codes provided by `Distributed Tracing` feature.
- We use Trace-ID as Request-ID (This does not interfere Tracing codes).

The target functions that we add Request-id codes are the following tables. The target functions include the klog call which called via important `kubectl` operations. These operations are listed in [Migration / Graduation Criteria](#migration--graduation-criteria) section. As a result, we can propagate Request-ID regarding important `kubectl` operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the criteria for picking those and not others?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What marks an important kubectl operation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added criteria and explanation in the KEP.

```
Request-ID with structured format
```go
klog.InfoS("Request-ID", trace-id, "Pod status updated", "pod", pod, "status", "ready")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Request-ID should be a normal field, meaning it should be put after message "Pod status updated"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I put request-id after the message.
btw, what is normal field? (sorry, I'm not familiar the structure of field of klog in k8s...)


It is difficult that support team in k8s Service Provider resolve end user's problem quickly in the above. Therefore, we'd like to add a new identifier which is unique to each user operation. This feature is useful for the following use cases:

#### Case 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help to readers to give these cases summaries. This one would probably be something like:
Given an API Request, find the resulting component logs

The one below would be something like
Given a component log, find the API Request

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added cases summaries.

- Adding a Request-ID into each K8s component log.
- The Request-ID is unique to a kubectl operation.
- (One kubectl operation by user causes multiple API requests and klog calls. Request-ID has same value in these klog calls.)
- Control enabled/disabled Request-ID feature(Request-ID feature is disabled on default to avoid an impact for existing user).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually the process of moving from:
disabled by default -> enabled by default -> always enabled
maps to
alpha -> beta -> GA

Unless you are proposing something outside of this, it probably doesn't need to be mentioned in goals.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are pointing feature gate.
I considered whether feature gate can satisfy our objective or not, but I'd like to allow the user to adjust the trade-off with log size. (This is similar to the concept of log level), so I'd like to implement the feature which can control request-id feature.

- To centrally manage the logs of each Kubernetes component with Request-ID (This can
be realized with existing OSS such as Kibana, so no need to implement into Kubernetes
components).
- We don't associate Request-ID to all of operations(Our target is important operations such as `kubectl create/delete/etc.`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe more precise here. You want to add it to write operations, as those are the ones that cause changes in the system.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right. I added more explanation.


## Summary

This KEP proposes a new unique logging meta-data into all Kubernetes logs. It makes us
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you have a well-defined list of places you would like to add this log below. Maybe say here that you are adding metadata to "key" kuberntes logs, rather than "all" logs. "all" logs would be an enormous undertaking and may not be realistic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I modified the word "all".

| ------ | ------ |
| --request-id=0 | Request-ID feature is disabled (Default) |
| --request-id=1 | Request-ID feature is enabled, and Request-ID is added to klogs related to the `Alpha` target operations |
| --request-id=2 | Request-ID feature is enabled, and Request-ID is added to klogs related to the `Alpha and Beta` target operations |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. This is the equivalent of verbosity for adding this info to logs. How do you plan to propagate the request id value here to controllers? Will it also have to be stored in an annotation and propagated the same way the SpanContext is for tracing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am considering the method, but I don't have the best idea now.
Currently, I think we don't use the same way the SpanContext. There is an idea that we add parameter into component-base. This is the same idea as structured logging(--logging-format). However, I think this requires restart of each K8s component when we changes the value of flag. Currently, I am investigating a method which applies against all of component and does not need restart.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serathius
In your strctured logging KEP, you are going to implement following structure.

LoggingConfig structure should be implemented as part of k8s.io/component-base options and include a common set of flags for logger initialization.

Can we add --request-id parameter into this structure? Since --request-id parameter is also related to logging, I think it would be a simpler implementation to include it in this structure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serathius
I have additional questions.

set of klog flags is included in klog package. Does LoggingConfig structure is related to these klog flags?
if there is a component which does not use logger initialization via component-base, we need to implement this against such component?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I understand the proposal now. I'll defer to @serathius, but I don't think we want to introduce a new verbosity level for each piece of metadata we are considering adding to logs.

If we are very concerned about this adding to log size, we might instead prioritize adding this field to lower (>4) verbosity logs. That way, we don't log this field very often by default, but turning up the verbosity adds many logs that contain the field.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks the comment.
I will wait reply from @serathius . I understand your opinion, because --request-id parameter's effect is similar to the existing verbosity level. (So, this may make the use of parameters difficult and complicated for users.)

If we use existing verbosity log(--v), I want to consider which level we should add request-id.

- Adding a Request-ID into each K8s component log.
- The Request-ID is unique to a kubectl operation.
- (One kubectl operation by user causes multiple API requests and klog calls. Request-ID has same value in these klog calls.)
- Control enabled/disabled Request-ID feature(Request-ID feature is disabled on default to avoid an impact for existing user).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document what the "impact" you are referring to here is?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the explanation about "impact".

except for the following points.

- Audit only collects information about http request sending and receiving in kube-apiserver, so it can't track internal work of each component.
- Audit logs can't be associated to logs related to user operation (kubectl operation), because auditID is different for each http request.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also note that context-based logging doesn't have nearly the same degree of security that audit logs have. The context propagation described in the tracing proposal doesn't guarantee that all actions taken by controllers will be tied back to the same trace id. In particular, if the trace context is overwritten, new spans/logs from components will be associated with the new trace context. We can't provide guarantees around the completeness of logs for a given trace id. Audit logging is also purposefully designed to be secure. Component logs do not provide the same protections.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, if the trace context is overwritten, new spans/logs from components will be associated with the new trace context.

Does this mean below? (I referred your KEP)

High-level processes, such as starting a pod or restarting a failed container, could be interrupted before completion by an update to the desired state. While this leaves a "partial" trace for the first process, it is the most accurate representation of the work and timing of reconciling desired and actual state.

Even If logs before overwriting trace-context have same trace-ID and associated logs after overwriting have another trace-ID, I think both logs are valuable from a troubleshooting standpoint.

This KEP proposes a new unique logging meta-data into key Kubernetes logs(Details are described in [Design of Propagate Request-ID section](#design-of-propagate-request-id)). It makes us
more easy to identify specific logs related to a single user operation (such as
`kubectl apply -f <my-pod.yaml>`). This feature is similar to
[Global request ID](https://docs.openstack.org/api-guide/compute/faults.html) for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add Glossary section to KEP and explain names like Global request ID in context of this KEP.
It would really help if you used same naming convention as tracing KEP https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/0034-distributed-tracing-kep.md#definitions
It's hard to map different terms used in both KEPs

This KEP proposes a new unique logging meta-data into key Kubernetes logs(Details are described in [Design of Propagate Request-ID section](#design-of-propagate-request-id)). It makes us
more easy to identify specific logs related to a single user operation (such as
`kubectl apply -f <my-pod.yaml>`). This feature is similar to
[Global request ID](https://docs.openstack.org/api-guide/compute/faults.html) for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recent changes to tracing KEP https://github.com/kubernetes/enhancements/pull/1458/files removed idea of propagating traces through K8s controllers. This means that you should rethink idea of Global request ID as now traces will be very short (they will cover only client -> apiserver -> etcd)


Support team in k8s Service Provider

### Target User's objective
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With object tracing removed from tracing KEP, this is no longer possible.

If the container is terminated by OOM killer, there is a case to break down the
issue into parts(Pod or Kubernetes) from the messages related OOM killer on host logs
and the API processing just before OOM killer. If the cause is that some unknown
pod creations, it is helpful to detect the root API request and who called this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is root API request?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2020
@hase1128
Copy link
Author

hase1128 commented Sep 8, 2020

I close this KEP, because this KEP has been replaced by the following KEP.
#1961

@hase1128 hase1128 closed this Sep 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants