-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trace context propagation #2312
Conversation
Signed-off-by: Li Zhijian <[email protected]>
Signed-off-by: Li Zhijian <[email protected]>
Signed-off-by: Li Zhijian <[email protected]>
Signed-off-by: Guangwen Feng <[email protected]>
Signed-off-by: Guangwen Feng <[email protected]>
Signed-off-by: Li Zhijian <[email protected]>
Welcome @zhijianli88! |
Hi @zhijianli88. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Related to: #1668 |
/retitle Trace context propagation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to take a look at this later this week.
/ok-to-test |
Signed-off-by: Guangwen Feng <[email protected]>
|
||
### Out-of-tree changes | ||
|
||
#### Mutating webhook |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also needs to address the problem of status updates. Status updates are writes, but don't indicate a change in desired state. For example, if we have the following sequence of events
- Create Deployment
- Update Deployment.Replicas from 1 -> 2
- Status update after creation
Since the status update (3) is done with the trace context from (1), it will overwrite the trace context from (2), which I don't think we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Status updates are writes, but don't indicate a change in desired state
Sorry, i cannot understand the problem you mentioned. I'm not very much clear about "status update", may i know which components make the 'status updates' request. I used to think(i hope) the only below 2 events happen.
1. Kubectl create deployment
deployment traceID: AAAA
replicaset traceID: AAAA
pod traceID: AAAA
2. Kunectl edit replicaset:1->2
deployment traceID: BBBB
replicaset traceID: BBBB
pod traceID: AAAA
(remain the same because the old pod is not recreated)
pod2 traceID: BBBB
I don't know when and how (3) happens. could you explain more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try and be more specific--the kube-controller manager does the following (very simplified):
- Observe that a deployment is created
- Create a replicaset based on the deployment
- Observe the status of the replicaset, when it changes... update the deployment's status. For example, it might update the
availableReplicas
field after a replicaset replica becomes available.
There are a few approaches I can think of that we could take to solve this problem:
- Don't propagate context (using WithObject) to status update requests. This means status updates are not linked back to the initial kubectl call.
- Make the webhook NOT write the annotation if the update is a status update. For example, we could exclude all calls to subresources using the resources rule in the admission registration config.
- Propagate baggage in addition to the trace context in annotations. Add a key in baggage, e.g.
k8s.io/nonmutating: true
to indicate that an API call, like a status update, isn't changing the desired state of an object. Since baggage is propagated along with the trace context to the webhook, the webhook can skip writing the annotation when baggage contains our key above.
We could provide a helper function to do this, similar to WithObject():
ctx = WithObject(ctx, obj)
ctx = WithMutating(ctx, false)
client.UpdateFooStatus(ctx, objstatus)
We could go even further, and move the WithMutating function inside UpdateFooStatus...
Its probably best to start with option 1 or 2, as they are simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. Thank you so much for detailed explaination!
I prefer option 2, as it will bring less in-tree changes.
/sig api-machinery |
Here is 2 examples to show how trace contexts changes if we use above method. Trace context propagation example for creation:
apiVersion: apps/v1
kind: Deployment
metadata:
generation: 2
managedFields:
- fieldsV1:
f:metadata:
f:spec:
manager: kubectl-create
time: "2021-07-07T12:00:00Z"
traceContext: {"A"}
traceGeneration: 1
- fieldsV1:
f:spec:
manager: kubectl-patch
time: "2021-07-07T12:00:00Z"
traceContext: {"B"}
traceGeneration: 2
spec:
status:
observedGeneration: 0
apiVersion: apps/v1
kind: ReplicaSet
metadata:
generation: 1
managedFields:
- fieldsV1:
f:metadata:
f:spec:
manager: kube-controller-manager
time: "2021-07-07T12:00:01Z"
traceContext: {"A", "B"}
traceGeneration: 1
spec:
status:
observedGeneration: 0
apiVersion: apps/v1
kind: Deployment
metadata:
generation: 2
managedFields:
- fieldsV1:
f:metadata:
f:spec:
manager: kubectl-create
time: "2021-07-07T12:00:00Z"
traceContext: {"A"}
traceGeneration: 1
- fieldsV1:
f:spec:
manager: kubectl-patch
time: "2021-07-07T12:00:00Z"
traceContext: {"B"}
traceGeneration: 2
- fieldsV1:
f:status:
f:observedGeneration: {}
manager: kube-controller-manager
time: "2021-07-07T12:00:01Z"
traceContext: {"A", "B"}
traceGeneration: 2
spec:
status:
observedGeneration: 2
apiVersion: v1
kind: Pod
metadata:
managedFields:
- fieldsV1:
f:metadata:
f:spec:
manager: kube-controller-manager
time: "2021-07-07T12:00:02Z"
traceContext: {"A", "B"}
- fieldsV1:
f:status:
manager: kubelet
time: "2021-07-07T12:00:03Z"
traceContext: {"A", "B"}
spec:
status:
apiVersion: apps/v1
kind: ReplicaSet
metadata:
generation: 1
managedFields:
- fieldsV1:
f:metadata:
f:spec:
manager: kube-controller-manager
time: "2021-07-07T12:00:01Z"
traceContext: {"A", "B"}
traceGeneration: 1
- fieldsV1:
f:status:
f:observedGeneration: {}
manager: kube-controller-manager
time: "2021-07-07T12:00:02Z"
traceContext: {"A", "B"}
traceGeneration: 1
spec:
status:
observedGeneration: 1
apiVersion: apps/v1
kind: Deployment
metadata:
generation: 2
managedFields:
- fieldsV1:
f:metadata:
f:spec:
manager: kubectl-create
time: "2021-07-07T12:00:00Z"
traceContext: {"A"}
traceGeneration: 1
- fieldsV1:
f:spec:
manager: kubectl-patch
time: "2021-07-07T12:00:00Z"
traceContext: {"B"}
traceGeneration: 2
- fieldsV1:
f:status:
f:observedGeneration: {}
manager: kube-controller-manager
time: "2021-07-07T12:00:01Z"
traceContext: {"A", "B"}
traceGeneration: 2
- fieldsV1:
f:status:
manager: kube-controller-manager
time: "2021-07-07T12:00:03Z"
traceContext: {"A", "B"}
traceGeneration: 2
spec:
status:
observedGeneration: 2 apiVersion: apps/v1
kind: ReplicaSet
metadata:
generation: 1
managedFields:
- fieldsV1:
f:metadata:
f:spec:
manager: kube-controller-manager
time: "2021-07-07T12:00:01Z"
traceContext: {"A", "B"}
traceGeneration: 1
- fieldsV1:
f:status:
f:observedGeneration: {}
manager: kube-controller-manager
time: "2021-07-07T12:00:02Z"
traceContext: {"A", "B"}
traceGeneration: 1
- fieldsV1:
f:status:
manager: kube-controller-manager
time: "2021-07-07T12:00:03Z"
traceContext: {"A", "B"}
traceGeneration: 1
spec:
status:
observedGeneration: 1 Trace context propagation example for update:
apiVersion: apps/v1
kind: Deployment
metadata:
generation: 4
managedFields:
- fieldsV1:
f:metadata:
f:spec:
manager: kubectl-create
time: "2021-07-07T12:00:00Z"
traceContext: {"A"}
traceGeneration: 1
- fieldsV1:
f:spec:
manager: kubectl-patch
time: "2021-07-07T12:00:00Z"
traceContext: {"B"}
traceGeneration: 2
- fieldsV1:
f:status:
f:observedGeneration: {}
manager: kube-controller-manager
time: "2021-07-07T12:00:01Z"
traceContext: {"A", "B"}
traceGeneration: 2
- fieldsV1:
f:status:
manager: kube-controller-manager
time: "2021-07-07T12:00:03Z"
traceContext: {"A", "B"}
traceGeneration: 2
- fieldsV1:
f:spec:
f:replicas: {}
manager: kubectl-scale
time: "2021-07-07T12:01:00Z"
traceContext: {"C"}
traceGeneration: 3
- fieldsV1:
f:spec:
f:template:
f:spec:
f:containers:
k:{}:
f:image: {}
manager: kubectl-edit
time: "2021-07-07T12:01:00Z"
traceContext: {"D"}
traceGeneration: 4
spec:
status:
observedGeneration: 2
apiVersion: apps/v1
kind: ReplicaSet
metadata:
generation: 1
managedFields:
- fieldsV1:
f:metadata:
f:spec:
manager: kube-controller-manager
time: "2021-07-07T12:01:01Z"
traceContext: {"C", "D"}
traceGeneration: 1
spec:
status:
observedGeneration: 0
apiVersion: apps/v1
kind: ReplicaSet
metadata:
generation: 2
managedFields:
- fieldsV1:
f:metadata:
f:spec:
manager: kube-controller-manager
time: "2021-07-07T12:00:01Z"
traceContext: {"A", "B"}
traceGeneration: 1
- fieldsV1:
f:status:
f:observedGeneration: {}
manager: kube-controller-manager
time: "2021-07-07T12:00:02Z"
traceContext: {"A", "B"}
traceGeneration: 1
- fieldsV1:
f:status:
manager: kube-controller-manager
time: "2021-07-07T12:00:03Z"
traceContext: {"A", "B"}
traceGeneration: 1
- fieldsV1:
f:spec:
f:replicas: {}
manager: kube-controller-manager
time: "2021-07-07T12:01:01Z"
traceContext: {"C", "D"}
traceGeneration: 2
spec:
status:
observedGeneration: 1
apiVersion: apps/v1
kind: Deployment
metadata:
generation: 4
managedFields:
- fieldsV1:
f:metadata:
f:spec:
manager: kubectl-create
time: "2021-07-07T12:00:00Z"
traceContext: {"A"}
traceGeneration: 1
- fieldsV1:
f:spec:
manager: kubectl-patch
time: "2021-07-07T12:00:00Z"
traceContext: {"B"}
traceGeneration: 2
- fieldsV1:
f:status:
manager: kube-controller-manager
time: "2021-07-07T12:00:03Z"
traceContext: {"A", "B"}
traceGeneration: 2
- fieldsV1:
f:spec:
f:replicas: {}
manager: kubectl-scale
time: "2021-07-07T12:01:00Z"
traceContext: {"C"}
traceGeneration: 3
- fieldsV1:
f:spec:
f:template:
f:spec:
f:containers:
k:{}:
f:image: {}
manager: kubectl-edit
time: "2021-07-07T12:01:00Z"
traceContext: {"D"}
traceGeneration: 4
- fieldsV1:
f:status:
f:observedGeneration: {}
manager: kube-controller-manager
time: "2021-07-07T12:01:01Z"
traceContext: {"C", "D"}
traceGeneration: 4
spec:
status:
observedGeneration: 4
apiVersion: v1
kind: Pod
metadata:
managedFields:
- fieldsV1:
f:metadata:
f:spec:
manager: kube-controller-manager
time: "2021-07-07T12:01:02Z"
traceContext: {"C", "D"}
- fieldsV1:
f:status:
manager: kubelet
time: "2021-07-07T12:01:03Z"
traceContext: {"C", "D"}
spec:
status:
apiVersion: apps/v1
kind: ReplicaSet
metadata:
generation: 1
managedFields:
- fieldsV1:
f:metadata:
f:spec:
manager: kube-controller-manager
time: "2021-07-07T12:01:01Z"
traceContext: {"C", "D"}
traceGeneration: 1
- fieldsV1:
f:status:
f:observedGeneration: {}
manager: kube-controller-manager
time: "2021-07-07T12:01:02Z"
traceContext: {"C", "D"}
traceGeneration: 1
spec:
status:
observedGeneration: 1
apiVersion: v1
kind: Pod
metadata:
managedFields:
- fieldsV1:
f:metadata:
f:spec:
manager: kube-controller-manager
time: "2021-07-07T12:01:02Z"
traceContext: {"C", "D"}
- fieldsV1:
f:status:
manager: kubelet
time: "2021-07-07T12:01:03Z"
traceContext: {"C", "D"}
spec:
status:
apiVersion: apps/v1
kind: ReplicaSet
metadata:
generation: 2
managedFields:
- fieldsV1:
f:metadata:
f:spec:
manager: kube-controller-manager
time: "2021-07-07T12:00:01Z"
traceContext: {"A", "B"}
traceGeneration: 1
- fieldsV1:
f:status:
manager: kube-controller-manager
time: "2021-07-07T12:00:03Z"
traceContext: {"A", "B"}
traceGeneration: 1
- fieldsV1:
f:spec:
f:replicas: {}
manager: kube-controller-manager
time: "2021-07-07T12:01:01Z"
traceContext: {"C", "D"}
traceGeneration: 2
- fieldsV1:
f:status:
f:observedGeneration: {}
manager: kube-controller-manager
time: "2021-07-07T12:01:02Z"
traceContext: {"C", "D"}
traceGeneration: 2
spec:
status:
observedGeneration: 2
apiVersion: apps/v1
kind: Deployment
metadata:
generation: 4
managedFields:
- fieldsV1:
f:metadata:
f:spec:
manager: kubectl-create
time: "2021-07-07T12:00:00Z"
traceContext: {"A"}
traceGeneration: 1
- fieldsV1:
f:spec:
manager: kubectl-patch
time: "2021-07-07T12:00:00Z"
traceContext: {"B"}
traceGeneration: 2
- fieldsV1:
f:spec:
f:replicas: {}
manager: kubectl-scale
time: "2021-07-07T12:01:00Z"
traceContext: {"C"}
traceGeneration: 3
- fieldsV1:
f:spec:
f:template:
f:spec:
f:containers:
k:{}:
f:image: {}
manager: kubectl-edit
time: "2021-07-07T12:01:00Z"
traceContext: {"D"}
traceGeneration: 4
- fieldsV1:
f:status:
f:observedGeneration: {}
manager: kube-controller-manager
time: "2021-07-07T12:01:01Z"
traceContext: {"C", "D"}
traceGeneration: 4
- fieldsV1:
f:status:
manager: kube-controller-manager
time: "2021-07-07T12:01:03Z"
traceContext: {"C", "D"}
traceGeneration: 4
spec:
status:
observedGeneration: 4 The new ReplicaSet: apiVersion: apps/v1
kind: ReplicaSet
metadata:
generation: 1
managedFields:
- fieldsV1:
f:metadata:
f:spec:
manager: kube-controller-manager
time: "2021-07-07T12:01:01Z"
traceContext: {"C", "D"}
traceGeneration: 1
- fieldsV1:
f:status:
f:observedGeneration: {}
manager: kube-controller-manager
time: "2021-07-07T12:01:02Z"
traceContext: {"C", "D"}
traceGeneration: 1
- fieldsV1:
f:status:
manager: kube-controller-manager
time: "2021-07-07T12:01:03Z"
traceContext: {"C", "D"}
traceGeneration: 1
spec:
status:
observedGeneration: 1 The old ReplicaSet: apiVersion: apps/v1
kind: ReplicaSet
metadata:
generation: 2
managedFields:
- fieldsV1:
f:metadata:
f:spec:
manager: kube-controller-manager
time: "2021-07-07T12:00:01Z"
traceContext: {"A", "B"}
traceGeneration: 1
- fieldsV1:
f:spec:
f:replicas: {}
manager: kube-controller-manager
time: "2021-07-07T12:01:01Z"
traceContext: {"C", "D"}
traceGeneration: 2
- fieldsV1:
f:status:
f:observedGeneration: {}
manager: kube-controller-manager
time: "2021-07-07T12:01:02Z"
traceContext: {"C", "D"}
traceGeneration: 2
- fieldsV1:
f:status:
manager: kube-controller-manager
time: "2021-07-07T12:01:03Z"
traceContext: {"C", "D"}
traceGeneration: 2
spec:
status:
observedGeneration: 2 |
IIUC, this is essentially the generation of that managedFields entry, and isn't necessarily specific to tracing. It would allow you to answer the question: Which fields have been changed, and are not reflected in the status? The only potential problems I can see are:
|
Yes, this way we can find out the proper managedFields entries, where the trace contexts are what we need to propagate for the controller reconcile.
Yes, this method doesn't work for those custom resources or operators that don't implement observedGeneration, but I'm not sure whether this is an obstacle for advancing this proposal. Without observedGeneration I don't know how to reliably figure out the intents that haven't been reflected...
Sorry, maybe I haven’t understood your concern correctly, could you please explain it in more detail?
Do you mean that there are other situations conflict with the above?
No, actually we just use trace-ids as a kind of global request-ids for logging, here by "propagate the trace contexts" I mean that we transfer the trace contexts as a set via baggage to apiserver, implement our own inject and extract methods, and write to the target objects via golang context. Is it acceptable to use opentelemetry's framework without really doing tracing? |
Many objects have observedGeneration. But there isn't guaranteed to be a single controller operating on an object. If there are two controllers, A and B, and both look at the object's spec, and update different fields in the object's status, then observedGeneration will only be accurate for the controller that writes to observedGeneration. If A is the controller that updates observedGeneration, B might attach a different set of trace contexts depending on whether B operates before or after A updates observedGeneration.
This is just context propagation, and isn't necessarily about tracing. Are all trace contexts sent via baggage, or is one still sent via the trace context header? |
Ah, thanks, I see.
Sorry, I may have been confused before, now I understand. |
There are possible concerns about the traceGeneration idea:
For 1, I heard that there is another idea, which is to create an object to record the linked contexts, it works for the case that multiple traceIDs caused the reconcile, but it may also require extra write for each reconcile loop. In comparison, the traceGeneration idea, which aims for logging, simply propagate the trace contexts as a data set along managedFields mechanism, so I don't think this will cause too much cost. For 2, in fact, things will not become more complex than before. Although there will be many traceIDs in an object, "which ones should be output to the log" depends on "which root request caused this log", I think the traceGeneration idea also works for this, i.e.: before reconcile we output the traceIDs that are traceGeneration > observedGeneration, after reconcile (means there is no traceGeneration bigger than observedGeneration), we just output the newest timestamp traceIDs. So that we can figure out "what happened for a request" by grepping its traceID, and also we can find out the root request when some errors occur. |
A note on the traceGeneration idea:
I'm not sure this always works. For example, a controller may fail to do something, but it still updates the observedGeneration in the status. The next reconcile will still be operating on the previous intent, so it wouldn't be correct to drop it. Thanks for the diagrams. A few questions:
|
Thanks for the review.
The traceGeneration idea:
In above example, 2 still works. If the controller fails to do something but still updates the observedGeneration, then this "status update" will propagate the trace contexts to the managedFieldsEntry that contains observedGeneration, which have the latest timestamp. In the next reconcile, those trace contexts can still be propagated correctly by 2.
Yes, we can improve our method like this: This will ensure that:
So the number only depends on how many concurrent requests act on the different part of an object. Sorry for my lack of production-level experience but I guess in most cases there is only one request in a time. Even if multiple writers case happens, the number of requests should not be that many. Even the worst case, the upper limit should be the number of the fields that can be modified in this object.
Thanks for pointing this out, I think the key point you said is that:
we haven't considered this before, but I feel that this is out of scope of this proposal, since our main purpose is to track root request made by users in order to ease the investigation in log, so we mainly focus on the workload related resources that user usually act on. |
When coding PoC, we found that if we can drop the old trace contexts while persisting the new ones into an object, then we don't even need to use traceGeneration or timestamps to determine the trace contexts that need to be propagated, because for every reconcile, all the trace contexts in the object need to be propagated. |
/unassign |
This essentially "falls back" to only storing a single trace context when something fails. E.g. if two concurrent updates, A and B, are done and fail, any subsequent retries will only be attributed to the last update. When things are failing is probably when it would be most important for me to have these links. The other thing is that when something is failing to reconcile, it may fail for a while. E.g. if a service has an outage for 10 minutes, it will fail a few times. This increases the probability that something will happen "concurrently", because the time window over which something is reconciled is expanded.
IIUC, this essentially amounts to de-duplicating trace contexts (i.e. not having a single trace context present in multiple managedfields entries), which is useful, but i'm primarily concerned about cases where we have a large number of unique trace contexts. This would be the case where a controller action takes multiple unique objects into account.
That is my conclusion as well. But I don't think thats an acceptable upper bound...
What is the scope then? Just a few objects (e.g. deployment, replicaset, pod)? A fairly substantial change like this is hard to justify for a small number of objects. |
Sorry, it seems that I didn’t explain it clearly before, it's not "falls back" to only storing a single trace context.
After "updates the observedGeneration in the status", A and B should have already been stored as a slice "A, B" in another managedFields (a status managedFieldsEntry), the "any subsequent retries" will propagate "A, B" rather than B.
Thanks for pointing this out, yes, the possibility of concurrent requests is high in some cases. When something happen "concurrently", we are able to propagated all of them correctly. No matter A and B are done and fail, or not, there will be only 2 cases:
For 1, A and B will be propagated.
Yes, in addition to the dedupe, actually we also remove the stale ones (the trace contexts that has been moved to the status managedFieldsEntry by status update) when there are new intents come in (i.e. new trace contexts be attached to the meta&spec managedFieldsEntry), therefore a long-running object will tend to have only 1 trace context.
Although the number may reach the upper bound at some time, new requests will flush the old trace contexts, so there will not always be so many trace contexts in an object. This is not a solution for the upper bound, but can mitigate the large number of unique trace contexts. IMHO the upper bound concern is inevitable as long as we attach to managedFields... |
These two seem contradictory to me. Either B will "flush" A (which may or may not be in-progress), or it won't. How do we determine if a new incoming trace ID will be added to the list of trace ids, or if it will replace the current list of trace ids on the object? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This KEP is a rewrite of Propose log tracking KEP based on this suggestion.
In brief, the previous KEP contains 2 topics: propagating trace context and adding context information to logs, which can lead to lot of discussion and making it hard to move forward, so rewrite the KEP to only focuses on a generic trace context propagating proposal as suggested.
Old PR #1961
Issue: #1668