Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace context propagation #2312

Closed
wants to merge 13 commits into from

Conversation

zhijianli88
Copy link

@zhijianli88 zhijianli88 commented Jan 22, 2021

This KEP is a rewrite of Propose log tracking KEP based on this suggestion.

In brief, the previous KEP contains 2 topics: propagating trace context and adding context information to logs, which can lead to lot of discussion and making it hard to move forward, so rewrite the KEP to only focuses on a generic trace context propagating proposal as suggested.

Old PR #1961
Issue: #1668

zhijianli88 and others added 6 commits January 21, 2021 14:27
Signed-off-by: Li Zhijian <[email protected]>
Signed-off-by: Li Zhijian <[email protected]>
Signed-off-by: Guangwen Feng <[email protected]>
Signed-off-by: Guangwen Feng <[email protected]>
Signed-off-by: Li Zhijian <[email protected]>
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 22, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @zhijianli88!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @zhijianli88. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jan 22, 2021
@k8s-ci-robot k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 22, 2021
@kikisdeliveryservice
Copy link
Member

Related to: #1668

@ehashman
Copy link
Member

/retitle Trace context propagation

@k8s-ci-robot k8s-ci-robot changed the title Trace context popagating Trace context propagation Jan 26, 2021
Copy link
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to take a look at this later this week.

keps/sig-instrumentation/1668-trace-popagating/kep.yaml Outdated Show resolved Hide resolved
@ehashman
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 27, 2021
keps/sig-instrumentation/1668-trace-popagating/README.md Outdated Show resolved Hide resolved
keps/sig-instrumentation/1668-trace-popagating/README.md Outdated Show resolved Hide resolved
keps/sig-instrumentation/1668-trace-popagating/README.md Outdated Show resolved Hide resolved

### Out-of-tree changes

#### Mutating webhook
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs to address the problem of status updates. Status updates are writes, but don't indicate a change in desired state. For example, if we have the following sequence of events

  1. Create Deployment
  2. Update Deployment.Replicas from 1 -> 2
  3. Status update after creation

Since the status update (3) is done with the trace context from (1), it will overwrite the trace context from (2), which I don't think we want.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status updates are writes, but don't indicate a change in desired state

Sorry, i cannot understand the problem you mentioned. I'm not very much clear about "status update", may i know which components make the 'status updates' request. I used to think(i hope) the only below 2 events happen.

1. Kubectl create deployment
deployment traceID: AAAA

replicaset traceID: AAAA

pod traceID: AAAA

2. Kunectl edit replicaset:1->2
deployment traceID: BBBB

replicaset traceID: BBBB

pod traceID: AAAA
(remain the same because the old pod is not recreated)

pod2 traceID: BBBB

I don't know when and how (3) happens. could you explain more details.

Copy link
Contributor

@dashpole dashpole Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try and be more specific--the kube-controller manager does the following (very simplified):

  1. Observe that a deployment is created
  2. Create a replicaset based on the deployment
  3. Observe the status of the replicaset, when it changes... update the deployment's status. For example, it might update the availableReplicas field after a replicaset replica becomes available.

There are a few approaches I can think of that we could take to solve this problem:

  1. Don't propagate context (using WithObject) to status update requests. This means status updates are not linked back to the initial kubectl call.
  2. Make the webhook NOT write the annotation if the update is a status update. For example, we could exclude all calls to subresources using the resources rule in the admission registration config.
  3. Propagate baggage in addition to the trace context in annotations. Add a key in baggage, e.g. k8s.io/nonmutating: true to indicate that an API call, like a status update, isn't changing the desired state of an object. Since baggage is propagated along with the trace context to the webhook, the webhook can skip writing the annotation when baggage contains our key above.
    We could provide a helper function to do this, similar to WithObject():
ctx = WithObject(ctx, obj)
ctx = WithMutating(ctx, false)
client.UpdateFooStatus(ctx, objstatus)

We could go even further, and move the WithMutating function inside UpdateFooStatus...

Its probably best to start with option 1 or 2, as they are simpler.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Thank you so much for detailed explaination!
I prefer option 2, as it will bring less in-tree changes.

keps/sig-instrumentation/1668-trace-popagating/README.md Outdated Show resolved Hide resolved
keps/sig-instrumentation/1668-trace-popagating/README.md Outdated Show resolved Hide resolved
keps/sig-instrumentation/1668-trace-popagating/README.md Outdated Show resolved Hide resolved
@ehashman
Copy link
Member

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jan 28, 2021
@fenggw-fnst
Copy link

Here is 2 examples to show how trace contexts changes if we use above method.

Trace context propagation example for creation:

  1. Assume user A creates a Deployment, and user B concurrently patches it before controller reconcile:
apiVersion: apps/v1
kind: Deployment
metadata:
  generation: 2
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kubectl-create
      time: "2021-07-07T12:00:00Z"
      traceContext: {"A"}
      traceGeneration: 1
    - fieldsV1:
        f:spec:
      manager: kubectl-patch
      time: "2021-07-07T12:00:00Z"
      traceContext: {"B"}
      traceGeneration: 2      
spec:
status:
  observedGeneration: 0
  1. Controller creates ReplicaSet according to the Deployment, propagates the traceContexts which the traceGeneration > observedGeneration, so the reconcile represents the result of both kubectl-create and kubectl-patch:
apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 1
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:01Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
spec:
status:
  observedGeneration: 0
  1. When controller updates Deployment's status, similarly, propagates the traceContexts which the traceGeneration > observedGeneration, until the observedGeneration is updated (observedGeneration == generation == the greatest traceGeneration:
apiVersion: apps/v1
kind: Deployment
metadata:
  generation: 2
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kubectl-create
      time: "2021-07-07T12:00:00Z"
      traceContext: {"A"}
      traceGeneration: 1
    - fieldsV1:
        f:spec:
      manager: kubectl-patch
      time: "2021-07-07T12:00:00Z"
      traceContext: {"B"}
      traceGeneration: 2
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:00:01Z"
      traceContext: {"A", "B"}
      traceGeneration: 2
spec:
status:
  observedGeneration: 2
  1. Controller creates Pods according to the ReplicaSet, propagates the traceContexts which the traceGeneration > observedGeneration, then kubelet updates the status with the latest propagated traceContext:
apiVersion: v1
kind: Pod
metadata:
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:02Z"
      traceContext: {"A", "B"}
    - fieldsV1:
        f:status:
      manager: kubelet
      time: "2021-07-07T12:00:03Z"
      traceContext: {"A", "B"}
spec:
status:
  1. When controller updates ReplicaSet's status, it is similar to Deployment, propagates the traceContext which the traceGeneration > observedGeneration, until the observedGeneration is updated (observedGeneration == generation == the greatest traceGeneration):
apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 1
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:01Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:00:02Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
spec:
status:
  observedGeneration: 1
  1. After desire has been reconciled (observedGeneration == generation == the greatest traceGeneration), controller updates status of Deployment and ReplicaSet with the latest traceContext:
apiVersion: apps/v1
kind: Deployment
metadata:
  generation: 2
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kubectl-create
      time: "2021-07-07T12:00:00Z"
      traceContext: {"A"}
      traceGeneration: 1
    - fieldsV1:
        f:spec:
      manager: kubectl-patch
      time: "2021-07-07T12:00:00Z"
      traceContext: {"B"}
      traceGeneration: 2
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:00:01Z"
      traceContext: {"A", "B"}
      traceGeneration: 2
    - fieldsV1:
        f:status:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:03Z"
      traceContext: {"A", "B"}
      traceGeneration: 2
spec:
status:
  observedGeneration: 2
apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 1
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:01Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:00:02Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
    - fieldsV1:
        f:status:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:03Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
spec:
status:
  observedGeneration: 1

Trace context propagation example for update:

  1. Assume user A scales Deployment's replica from 4 to 8, and user B concurrently edits it to a invalid image before controller reconcile:
apiVersion: apps/v1
kind: Deployment
metadata:
  generation: 4
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kubectl-create
      time: "2021-07-07T12:00:00Z"
      traceContext: {"A"}
      traceGeneration: 1
    - fieldsV1:
        f:spec:
      manager: kubectl-patch
      time: "2021-07-07T12:00:00Z"
      traceContext: {"B"}
      traceGeneration: 2
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:00:01Z"
      traceContext: {"A", "B"}
      traceGeneration: 2
    - fieldsV1:
        f:status:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:03Z"
      traceContext: {"A", "B"}
      traceGeneration: 2
    - fieldsV1:
        f:spec:
          f:replicas: {}
      manager: kubectl-scale
      time: "2021-07-07T12:01:00Z"
      traceContext: {"C"}
      traceGeneration: 3
    - fieldsV1:
        f:spec:
          f:template:
            f:spec:
              f:containers:
                k:{}:
                  f:image: {}
      manager: kubectl-edit
      time: "2021-07-07T12:01:00Z"
      traceContext: {"D"}
      traceGeneration: 4
spec:
status:
  observedGeneration: 2
  1. Controller creates a new ReplicaSet with Deployment's traceContexts of traceGeneration > observedGeneration:
apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 1
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:01:01Z"
      traceContext: {"C", "D"}
      traceGeneration: 1
spec:
status:
  observedGeneration: 0
  1. Since it has been set that replicas == 8, controller will also scale up the old ReplicaSet to 6 (assume default maxSurge and maxUnavailable) with Deployment's traceContexts of traceGeneration > observedGeneration:
apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 2
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:01Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:00:02Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
    - fieldsV1:
        f:status:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:03Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
    - fieldsV1:
        f:spec:
          f:replicas: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:01:01Z"
      traceContext: {"C", "D"}
      traceGeneration: 2
spec:
status:
  observedGeneration: 1
  1. After controller has reconciled the intents for Deployment, it will also update the its status with the traceContexts of traceGeneration > observedGeneration:
apiVersion: apps/v1
kind: Deployment
metadata:
  generation: 4
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kubectl-create
      time: "2021-07-07T12:00:00Z"
      traceContext: {"A"}
      traceGeneration: 1
    - fieldsV1:
        f:spec:
      manager: kubectl-patch
      time: "2021-07-07T12:00:00Z"
      traceContext: {"B"}
      traceGeneration: 2
    - fieldsV1:
        f:status:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:03Z"
      traceContext: {"A", "B"}
      traceGeneration: 2
    - fieldsV1:
        f:spec:
          f:replicas: {}
      manager: kubectl-scale
      time: "2021-07-07T12:01:00Z"
      traceContext: {"C"}
      traceGeneration: 3
    - fieldsV1:
        f:spec:
          f:template:
            f:spec:
              f:containers:
                k:{}:
                  f:image: {}
      manager: kubectl-edit
      time: "2021-07-07T12:01:00Z"
      traceContext: {"D"}
      traceGeneration: 4
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:01:01Z"
      traceContext: {"C", "D"}
      traceGeneration: 4
spec:
status:
  observedGeneration: 4
  1. For the new ReplicaSet, it will try to create 4 Pods at first (assume default maxSurge and maxUnavailable), because the image is invalid, actually Pods creation will be stuck. Controller will propagate the new ReplicaSet's traceContexts of traceGeneration > observedGeneration to these Pods, then kubelet updates the status with the latest propagated traceContext:
apiVersion: v1
kind: Pod
metadata:
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:01:02Z"
      traceContext: {"C", "D"}
    - fieldsV1:
        f:status:
      manager: kubelet
      time: "2021-07-07T12:01:03Z"
      traceContext: {"C", "D"}
spec:
status:
  1. Then controller will also update the ReplicaSet's status with the traceContexts of traceGeneration > observedGeneration:
apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 1
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:01:01Z"
      traceContext: {"C", "D"}
      traceGeneration: 1
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:01:02Z"
      traceContext: {"C", "D"}
      traceGeneration: 1
spec:
status:
  observedGeneration: 1
  1. For the old ReplicaSet, it propagates the traceContexts of traceGeneration > observedGeneration to the Pods scaled up, then kubelet updates the status with the latest propagated traceContext:
apiVersion: v1
kind: Pod
metadata:
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:01:02Z"
      traceContext: {"C", "D"}
    - fieldsV1:
        f:status:
      manager: kubelet
      time: "2021-07-07T12:01:03Z"
      traceContext: {"C", "D"}
spec:
status:
  1. Controller will also update the old ReplicaSet status with the Deployment's traceContexts of traceGeneration > observedGeneration:
apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 2
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:01Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
    - fieldsV1:
        f:status:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:03Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
    - fieldsV1:
        f:spec:
          f:replicas: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:01:01Z"
      traceContext: {"C", "D"}
      traceGeneration: 2
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:01:02Z"
      traceContext: {"C", "D"}
      traceGeneration: 2
spec:
status:
  observedGeneration: 2
  1. After observedGeneration == generation == the greatest traceGeneration, any further status update for Deployment and ReplicaSet will propagate traceContext with the latest time stamp.
apiVersion: apps/v1
kind: Deployment
metadata:
  generation: 4
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kubectl-create
      time: "2021-07-07T12:00:00Z"
      traceContext: {"A"}
      traceGeneration: 1
    - fieldsV1:
        f:spec:
      manager: kubectl-patch
      time: "2021-07-07T12:00:00Z"
      traceContext: {"B"}
      traceGeneration: 2
    - fieldsV1:
        f:spec:
          f:replicas: {}
      manager: kubectl-scale
      time: "2021-07-07T12:01:00Z"
      traceContext: {"C"}
      traceGeneration: 3
    - fieldsV1:
        f:spec:
          f:template:
            f:spec:
              f:containers:
                k:{}:
                  f:image: {}
      manager: kubectl-edit
      time: "2021-07-07T12:01:00Z"
      traceContext: {"D"}
      traceGeneration: 4
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:01:01Z"
      traceContext: {"C", "D"}
      traceGeneration: 4
    - fieldsV1:
        f:status:
      manager: kube-controller-manager
      time: "2021-07-07T12:01:03Z"
      traceContext: {"C", "D"}
      traceGeneration: 4
spec:
status:
  observedGeneration: 4

The new ReplicaSet:

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 1
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:01:01Z"
      traceContext: {"C", "D"}
      traceGeneration: 1
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:01:02Z"
      traceContext: {"C", "D"}
      traceGeneration: 1
    - fieldsV1:
        f:status:
      manager: kube-controller-manager
      time: "2021-07-07T12:01:03Z"
      traceContext: {"C", "D"}
      traceGeneration: 1
spec:
status:
  observedGeneration: 1

The old ReplicaSet:

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  generation: 2
  managedFields:
    - fieldsV1:
        f:metadata:
        f:spec:
      manager: kube-controller-manager
      time: "2021-07-07T12:00:01Z"
      traceContext: {"A", "B"}
      traceGeneration: 1
    - fieldsV1:
        f:spec:
          f:replicas: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:01:01Z"
      traceContext: {"C", "D"}
      traceGeneration: 2
    - fieldsV1:
        f:status:
          f:observedGeneration: {}
      manager: kube-controller-manager
      time: "2021-07-07T12:01:02Z"
      traceContext: {"C", "D"}
      traceGeneration: 2
    - fieldsV1:
        f:status:
      manager: kube-controller-manager
      time: "2021-07-07T12:01:03Z"
      traceContext: {"C", "D"}
      traceGeneration: 2
spec:
status:
  observedGeneration: 2

@dashpole
Copy link
Contributor

For 1, we can use generation and observedGeneration to find out the correct trace contexts for propagation.

In some objects such as Deployment or ReplicaSet, for every spec update, the generation will be incremented by 1, we can make use of this feature to add a similar field "traceGeneration" in managedFieldEntry to track the created/updated trace contexts. When controller reconciles multiple actors' intents, we propagate the trace contexts that are traceGeneration > observedGeneration.

IIUC, this is essentially the generation of that managedFields entry, and isn't necessarily specific to tracing. It would allow you to answer the question: Which fields have been changed, and are not reflected in the status?

The only potential problems I can see are:

  • observedGeneration is a convention, not a requirement for objects. So it may work for the built-in objects, but we can't use it generically.
  • If there are multiple controllers that operate on an object and update status, only one of them generally updates observedGeneration. One example is something like a pod readiness gate, where a controller can perform "setup" for a pod, such as configuring network, and then updates a pod condition to indicate that portion of setup is complete.
  • You say we will "propagate the trace contexts". The w3c spec for trace context only allows for a single one to be propagated. Would we be defining our own standard for this?

@fenggw-fnst
Copy link

IIUC, this is essentially the generation of that managedFields entry, and isn't necessarily specific to tracing. It would allow you to answer the question: Which fields have been changed, and are not reflected in the status?

Yes, this way we can find out the proper managedFields entries, where the trace contexts are what we need to propagate for the controller reconcile.

observedGeneration is a convention, not a requirement for objects. So it may work for the built-in objects, but we can't use it generically.

Yes, this method doesn't work for those custom resources or operators that don't implement observedGeneration, but I'm not sure whether this is an obstacle for advancing this proposal. Without observedGeneration I don't know how to reliably figure out the intents that haven't been reflected...

If there are multiple controllers that operate on an object and update status, only one of them generally updates observedGeneration. One example is something like a pod readiness gate, where a controller can perform "setup" for a pod, such as configuring network, and then updates a pod condition to indicate that portion of setup is complete.

Sorry, maybe I haven’t understood your concern correctly, could you please explain it in more detail?

  • In this idea, before observedGeneration be updated to generation, we use the trace contexts that traceGeneration > observedGeneration.
  • For the reconcile that update observedGeneration to generation, we use the trace contexts that traceGeneration > observedGeneration.
  • After observedGeneration == generation, we use the newest timestamp trace context.

Do you mean that there are other situations conflict with the above?

You say we will "propagate the trace contexts". The w3c spec for trace context only allows for a single one to be propagated. Would we be defining our own standard for this?

No, actually we just use trace-ids as a kind of global request-ids for logging, here by "propagate the trace contexts" I mean that we transfer the trace contexts as a set via baggage to apiserver, implement our own inject and extract methods, and write to the target objects via golang context. Is it acceptable to use opentelemetry's framework without really doing tracing?

@dashpole
Copy link
Contributor

Sorry, maybe I haven’t understood your concern correctly, could you please explain it in more detail?

Many objects have observedGeneration. But there isn't guaranteed to be a single controller operating on an object. If there are two controllers, A and B, and both look at the object's spec, and update different fields in the object's status, then observedGeneration will only be accurate for the controller that writes to observedGeneration. If A is the controller that updates observedGeneration, B might attach a different set of trace contexts depending on whether B operates before or after A updates observedGeneration.

by "propagate the trace contexts" I mean that we transfer the trace contexts as a set via baggage to apiserver, implement our own inject and extract methods, and write to the target objects via golang context.

This is just context propagation, and isn't necessarily about tracing. Are all trace contexts sent via baggage, or is one still sent via the trace context header?

@fenggw-fnst
Copy link

Many objects have observedGeneration. But there isn't guaranteed to be a single controller operating on an object. If there are two controllers, A and B, and both look at the object's spec, and update different fields in the object's status, then observedGeneration will only be accurate for the controller that writes to observedGeneration. If A is the controller that updates observedGeneration, B might attach a different set of trace contexts depending on whether B operates before or after A updates observedGeneration.

Ah, thanks, I see.
IMHO, although it depends on whether B operates before or after A updates observedGeneration, actually the trace contexts are the same.

  1. If B operates before A updates observedGeneration, then we choose the trace contexts that traceGeneration > observedGeneration.
  2. If B operates after A updates observedGeneration, then we choose the newest timestamp one, which is a set consisting of the trace contexts of 1.

This is just context propagation, and isn't necessarily about tracing. Are all trace contexts sent via baggage, or is one still sent via the trace context header?

Sorry, I may have been confused before, now I understand.
Yes, there should be one still sent via the trace context header. We use both the w3c spec trace context and the baggage. Since we have attached the trace contexts we need as a set, for the w3c trace context we just use any one randomly or the latest one.

@fenggw-fnst
Copy link

There are possible concerns about the traceGeneration idea:

  1. The cost could be expensive if requiring an extra write for each reconcile loop.
  2. Previously, we could just search logs for a single traceID string, but in order to solve the multi-writer problem, we decided to use managedFields, which allows multiple traceIDs in an object, now things got more complex, so we should figure out how one would reconstruct the "what happened for this request" flow.

For 1, I heard that there is another idea, which is to create an object to record the linked contexts, it works for the case that multiple traceIDs caused the reconcile, but it may also require extra write for each reconcile loop. In comparison, the traceGeneration idea, which aims for logging, simply propagate the trace contexts as a data set along managedFields mechanism, so I don't think this will cause too much cost.

For 2, in fact, things will not become more complex than before. Although there will be many traceIDs in an object, "which ones should be output to the log" depends on "which root request caused this log", I think the traceGeneration idea also works for this, i.e.: before reconcile we output the traceIDs that are traceGeneration > observedGeneration, after reconcile (means there is no traceGeneration bigger than observedGeneration), we just output the newest timestamp traceIDs. So that we can figure out "what happened for a request" by grepping its traceID, and also we can find out the root request when some errors occur.

@fenggw-fnst
Copy link

Here is an additional introduction on how the traceGeneration idea can be used for logging:

logEnhancementChartP1
logEnhancementChartP2
logEnhancementChartP3

@dashpole
Copy link
Contributor

A note on the traceGeneration idea:

When controller reconciles multiple actors' intents, we propagate the trace contexts that are traceGeneration > observedGeneration.

I'm not sure this always works. For example, a controller may fail to do something, but it still updates the observedGeneration in the status. The next reconcile will still be operating on the previous intent, so it wouldn't be correct to drop it.

Thanks for the diagrams. A few questions:

  1. Is there a way to bound the number of trace contexts attached to a single object? Two seems reasonable, but twenty seems excessive.
  2. Deployment -> replicaset -> pod forms a tree, with a single deployment makes multiple RS, which each make multiple pods. That means each action (e.g. create replicaset) is always the result of one object (e.g. the deployment). That makes your case simpler, because you only have to care about contexts from a single parent object. In cases where objects do not form a tree, it seems like we could easily get an explosion of contexts. E.g. kube-proxy, when it updates iptable rules on a node, does so based on all cluster ip services. In the worst case, you could have an inverse tree, where object A is based on multiple object Bs, each of which is based on multiple object Cs.

@fenggw-fnst
Copy link

Thanks for the review.

I'm not sure this always works. For example, a controller may fail to do something, but it still updates the observedGeneration in the status. The next reconcile will still be operating on the previous intent, so it wouldn't be correct to drop it.

The traceGeneration idea:

  1. If there are traceGeneration > observedGeneration, then we propagate the trace contexts meet this condition.
  2. If 1 is not satisfied, then we propagate the trace contexts with the latest timestamp.

In above example, 2 still works. If the controller fails to do something but still updates the observedGeneration, then this "status update" will propagate the trace contexts to the managedFieldsEntry that contains observedGeneration, which have the latest timestamp. In the next reconcile, those trace contexts can still be propagated correctly by 2.

  1. Is there a way to bound the number of trace contexts attached to a single object? Two seems reasonable, but twenty seems excessive.

Yes, we can improve our method like this:
When controller reconcile propagates trace contexts between different managedFieldsEntry (the managedFields containing meta/spec or status) within a single object, do not copy them but move them. (clean up old ones)

This will ensure that:

  • Before controller reconcile, there are the newest trace contexts attached to those managedFieldsEntry that contain metadata or spec, each managedFieldsEntry only contains one trace context representing one request, the number depends on how many requests act on an object at one time.
  • After controller reconcile, there is only one traceContext slice containing above trace contexts in the managedFieldsEntry that contains status. (the old traceContext will be overwritten if exists)

So the number only depends on how many concurrent requests act on the different part of an object. Sorry for my lack of production-level experience but I guess in most cases there is only one request in a time. Even if multiple writers case happens, the number of requests should not be that many. Even the worst case, the upper limit should be the number of the fields that can be modified in this object.

  1. Deployment -> replicaset -> pod forms a tree, with a single deployment makes multiple RS, which each make multiple pods. That means each action (e.g. create replicaset) is always the result of one object (e.g. the deployment). That makes your case simpler, because you only have to care about contexts from a single parent object. In cases where objects do not form a tree, it seems like we could easily get an explosion of contexts. E.g. kube-proxy, when it updates iptable rules on a node, does so based on all cluster ip services. In the worst case, you could have an inverse tree, where object A is based on multiple object Bs, each of which is based on multiple object Cs.

Thanks for pointing this out, I think the key point you said is that:

  • when propagating trace contexts across different resources, how we deal with the case of "controller reconcile is based on multiple objects" ?

we haven't considered this before, but I feel that this is out of scope of this proposal, since our main purpose is to track root request made by users in order to ease the investigation in log, so we mainly focus on the workload related resources that user usually act on.

@fenggw-fnst
Copy link

  1. Is there a way to bound the number of trace contexts attached to a single object? Two seems reasonable, but twenty seems excessive.

Yes, we can improve our method like this:

When coding PoC, we found that if we can drop the old trace contexts while persisting the new ones into an object, then we don't even need to use traceGeneration or timestamps to determine the trace contexts that need to be propagated, because for every reconcile, all the trace contexts in the object need to be propagated.

@ehashman
Copy link
Member

ehashman commented Sep 1, 2021

/unassign

@dashpole
Copy link
Contributor

dashpole commented Sep 9, 2021

The traceGeneration idea:

  1. If there are traceGeneration > observedGeneration, then we propagate the trace contexts meet this condition.
  2. If 1 is not satisfied, then we propagate the trace contexts with the latest timestamp.

In above example, 2 still works. If the controller fails to do something but still updates the observedGeneration, then this "status update" will propagate the trace contexts to the managedFieldsEntry that contains observedGeneration, which have the latest timestamp. In the next reconcile, those trace contexts can still be propagated correctly by 2.

This essentially "falls back" to only storing a single trace context when something fails. E.g. if two concurrent updates, A and B, are done and fail, any subsequent retries will only be attributed to the last update. When things are failing is probably when it would be most important for me to have these links.

The other thing is that when something is failing to reconcile, it may fail for a while. E.g. if a service has an outage for 10 minutes, it will fail a few times. This increases the probability that something will happen "concurrently", because the time window over which something is reconciled is expanded.

When controller reconcile propagates trace contexts between different managedFieldsEntry (the managedFields containing meta/spec or status) within a single object, do not copy them but move them. (clean up old ones)

IIUC, this essentially amounts to de-duplicating trace contexts (i.e. not having a single trace context present in multiple managedfields entries), which is useful, but i'm primarily concerned about cases where we have a large number of unique trace contexts. This would be the case where a controller action takes multiple unique objects into account.

Even the worst case, the upper limit should be the number of the fields that can be modified in this object.

That is my conclusion as well. But I don't think thats an acceptable upper bound...

we haven't considered this before, but I feel that this is out of scope of this proposal, since our main purpose is to track root request made by users in order to ease the investigation in log, so we mainly focus on the workload related resources that user usually act on.

What is the scope then? Just a few objects (e.g. deployment, replicaset, pod)? A fairly substantial change like this is hard to justify for a small number of objects.

@fenggw-fnst
Copy link

This essentially "falls back" to only storing a single trace context when something fails. E.g. if two concurrent updates, A and B, are done and fail, any subsequent retries will only be attributed to the last update. When things are failing is probably when it would be most important for me to have these links.

Sorry, it seems that I didn’t explain it clearly before, it's not "falls back" to only storing a single trace context.

For example, a controller may fail to do something, but it still updates the observedGeneration in the status.
E.g. if two concurrent updates, A and B, are done and fail

After "updates the observedGeneration in the status", A and B should have already been stored as a slice "A, B" in another managedFields (a status managedFieldsEntry), the "any subsequent retries" will propagate "A, B" rather than B.

The other thing is that when something is failing to reconcile, it may fail for a while. E.g. if a service has an outage for 10 minutes, it will fail a few times. This increases the probability that something will happen "concurrently", because the time window over which something is reconciled is expanded.

Thanks for pointing this out, yes, the possibility of concurrent requests is high in some cases. When something happen "concurrently", we are able to propagated all of them correctly. No matter A and B are done and fail, or not, there will be only 2 cases:

  1. A and B are still in the meta&spec managedFieldsEntry.
  2. A and B have been stored in the status managedFieldsEntry as "A, B" (observedGeneration updated).

For 1, A and B will be propagated.
For 2, "A, B" will be propagated.

IIUC, this essentially amounts to de-duplicating trace contexts (i.e. not having a single trace context present in multiple managedfields entries), which is useful,

Yes, in addition to the dedupe, actually we also remove the stale ones (the trace contexts that has been moved to the status managedFieldsEntry by status update) when there are new intents come in (i.e. new trace contexts be attached to the meta&spec managedFieldsEntry), therefore a long-running object will tend to have only 1 trace context.

but i'm primarily concerned about cases where we have a large number of unique trace contexts. This would be the case where a controller action takes multiple unique objects into account.
That is my conclusion as well. But I don't think thats an acceptable upper bound...

Although the number may reach the upper bound at some time, new requests will flush the old trace contexts, so there will not always be so many trace contexts in an object. This is not a solution for the upper bound, but can mitigate the large number of unique trace contexts. IMHO the upper bound concern is inevitable as long as we attach to managedFields...

@dashpole
Copy link
Contributor

After "updates the observedGeneration in the status", A and B should have already been stored as a slice "A, B" in another managedFields (a status managedFieldsEntry), the "any subsequent retries" will propagate "A, B" rather than B.

new requests will flush the old trace contexts, so there will not always be so many trace contexts in an object

These two seem contradictory to me. Either B will "flush" A (which may or may not be in-progress), or it won't. How do we determine if a new incoming trace ID will be added to the list of trace ids, or if it will replace the current list of trace ids on the object?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2021
@KobayashiD27
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 20, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 19, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.