Propose KEP: Leveraging Distributed Tracing to Understand Kubernetes Object Lifecycles #650

Monkeyanator · 2018-12-07T21:44:21Z

k8s-ci-robot · 2018-12-07T21:44:23Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Monkeyanator · 2018-12-07T22:05:06Z

@kubernetes/sig-instrumentation-feature-requests

k8s-ci-robot · 2018-12-07T22:05:13Z

@Monkeyanator: Reiterating the mentions to trigger a notification:
@kubernetes/sig-instrumentation-feature-requests

In response to this:

@kubernetes/sig-instrumentation-feature-requests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Monkeyanator · 2018-12-07T22:14:00Z

/assign @brancz

brancz

First run. I think before I'm comfortable deciding on this architecture I'd like us to do some research and reflect on different possible solutions and dependencies. Generally super excited about this though!

keps/sig-instrumentation/0034-distributed-tracing-kep.md

brancz · 2018-12-08T18:28:12Z

keps/sig-instrumentation/0034-distributed-tracing-kep.md

+
+Distributed tracing, on the other hand, provides a single window into latency information from across many components and plugins. Trace data is structured, and there are numerous established backends for visualizing and querying over it. This KEP would make it possible to, for instance, retrieve and visualize all pod startups that took more than 30 seconds, involved an `nginx` container, and which mounted more than two volumes.
+
+In addition, due to the self-healing nature of Kubernetes, regressions wherein latencies are affected but the overall task is eventually accomplished are not uncommon. With our current monitoring architecture, these "soft regressions" are often difficult to observe and diagnose. Collecting structured trace data on per-object latencies would enable us to detect these long-term regressions automatically, and quickly determine their root causes.


Can you be more specific in what you mean by "soft regressions" and how the monitoring architecture is not sufficient?

For this specific use case it sounds like to me that a combination of both improving the metrics instrumentation (which is indeed not good enough today) plus sampling "bad" traces would significantly improve the current debugging process.

By "soft regression", I mean an issue that doesn't result in a definitive failure, but rather in degraded performance.

You are definitely correct in that even just improving the metrics and sampling bad traces would improve the current process. I think what I was trying to highlight here was that there is potential to plug into existing trace analysis tools to perform automatic root-cause-analysis.

This could make it possible to, for example, detect a latency regression in pod startup, and then attribute that regression to a change in some metadata (such as a container version, or notice that the regression shows when a pod mounts a certain volume, etc etc). Latency metrics lack the structure / context required to perform this kind of analysis.

Will clarify the KEP on this point.

@dashpole on this as well, who might have a better idea on how this will fit in with existing latency metrics

The general point here is that in addition to identifying that a regression has occurred, tracing also helps identifies the root-cause of the issue.

I updated this section to be specific on the problems we are solving.

brancz · 2018-12-08T18:32:56Z

keps/sig-instrumentation/0034-distributed-tracing-kep.md

+This KEP proposes the use of the [OpenCensus tracing framework](https://opencensus.io/) to create and export spans to configured backends. The OpenCensus framework was chosen for various reasons: 
+
+1) Provides concrete, tested implementations for creating and exporting spans to diverse backends, rather than providing an API specification, as is the case with [OpenTracing](https://opentracing.io/specification/)
+2) [Provides an agent](https://github.com/census-instrumentation/opencensus-service) which enables lazy configuration for exporters, batching of spans, and other features 


This is likely going to need a sig-architecture discussion, as I'm not sure this heavy of a dependency is something we want to carry long term. I don't know enough about OpenCensus, is this really a required component?

The OpenCensus agent is not required, but it is the solution we're leaning towards for the initial version. The attractive feature about the agent is that it allows us to configure the destination for our exported traces on-the-fly, and in an out-of-tree component (less in-tree changes).

The main alternative to using the agent would be to export spans from the instrumented components themselves directly to the tracing backends (which is what our current implementation work has been doing). This is a valid alternative, and I will update this section in the KEP to discuss it.

I see the reason and benefit of extracting this into the sidecar, but I'm not seeing this feature ever leaving preview or alpha state without this issue being resolved. I'm ok with it at this stage, but I want to have mentioned it upfront, as I have doubts with sig-architecture approving this even as an optional feature, as it's a significant change to how Kubernetes is used/deployed/operated. The OpenCensus team even encountered problems suggesting to deploy the agent as DaemonSets. See "Open Questions" here: https://docs.google.com/document/d/1U2McyGwPIm0win_0uNQqUlPJrrQh1WH5J4m8q8KQyv4/edit#heading=h.rgbw704usq10

Added review from sig-instrumentation and sig-architecture on this for beta

brancz · 2018-12-08T18:40:25Z

keps/sig-instrumentation/0034-distributed-tracing-kep.md

+
+To correlate work done between components as belonging to the same trace, we must pass span context across process boundaries. In traditional distributed systems, this context can be passed down through RPC metadata or HTTP headers. Kubernetes, however, due to its watch-based nature, requires us to attach trace context directly to the target object. 
+
+In this proposal, we choose to propagate this span context as an encoded string an object annotation called `trace.kubernetes.io/context`. This annotation value is regenerated and replaced when an object's trace ends, to achieve the desired behavior from [section one](#trace-lifecycle). 


I might be missing something, but this seems like it's prone to multiple "traces being started" concurrently causing race conditions where trace contexts are concurrently overwritten.

As long as we ensure that there's a single state transition that we consider the beginning of a trace, and a single state transition that marks its end, I believe we should be able to avoid any race conditions here.

@dashpole on this as well.

It is worth noting that updates to an object's trace annotation should only be done by a single component, usually the controller responsible for updating the status of the object. For example, the kubelet updates the annotation after updating the pod from pending -> running.

On more thought, I think I understand where this is coming from. Concurrent updates shouldn't be an issue, as the last update should be the trace context used, but there could be a race between "ending a trace" by replacing the trace context, and "starting a trace" from an update, for example.

brancz · 2018-12-08T18:43:32Z

keps/sig-instrumentation/0034-distributed-tracing-kep.md

+
+This KEP proposes the use of the [OpenCensus tracing framework](https://opencensus.io/) to create and export spans to configured backends. The OpenCensus framework was chosen for various reasons: 
+
+1) Provides concrete, tested implementations for creating and exporting spans to diverse backends, rather than providing an API specification, as is the case with [OpenTracing](https://opentracing.io/specification/)


I'm generally a big fan of the motivations and intentions of the OpenCensus project, but I'm a little concerned about it being a rather young project.

Agreed, the OC project is still quite young. However, based on the fact that this would be an experimental, opt-in alpha feature, it might be acceptable for us to bring in for use provided we stick to its stable features (starting, ending, and exporting spans).

brancz · 2018-12-08T18:47:12Z

keps/sig-instrumentation/0034-distributed-tracing-kep.md

+
+#### Context propagation
+
+To correlate work done between components as belonging to the same trace, we must pass span context across process boundaries. In traditional distributed systems, this context can be passed down through RPC metadata or HTTP headers. Kubernetes, however, due to its watch-based nature, requires us to attach trace context directly to the target object. 


I've thought about this before, and I'm not entirely sure this is 100% true, just properly solving this sounds like a larger effort, being making etcd context/tracing aware, where any modification call to etcd is carried through etcd and published in the watch event.

Since the proposal suggests attaching span context to the object metadata, as an annotation, it shouldn't introduce any additional complexity to etcd.

While some of the previous discussion around tracing has called for adding trace awareness to etcd, and hooking into writes for trace points, our proposal doesn't suggest this route. Is this what you mean by "making etcd context/tracing aware?"

I meant that we should technically be able to trace everything even though Kubernetes and "its watch-based nature". Any event from a watch could have the trace ID of the origin change done against the API.

added this to the KEP

dashpole · 2018-12-20T22:11:50Z

keps/sig-instrumentation/0034-distributed-tracing-kep.md

+
+* **Logs**: are fragmented, and finding out which process was the bottleneck involves digging through troves of unstructured text. In addition, logs do not offer higher-level insight into overall system behavior without an extensive background on the process of interest. 
+* **Events**: in Kubernetes are only kept for an hour by default, and don't integrate with visualization of analysis tools. To gain trace-like insights would require a large investment in custom tooling.
+* **Latency metrics**: are gathered in some places, but these don't provide understanding into _why_ a given process was slow.


Part of the reason why latency metrics aren't a great way to determine why a process was slow is for carnality reasons. You wouldn't, for example, want to attach the container ID of a hypothetical container_start_latency metric because you would be making a new metric stream for each container, each with only a single sample taken.

wojtek-t

@gmarek - FYI

wojtek-t · 2018-12-28T15:32:27Z

keps/sig-instrumentation/0034-distributed-tracing-kep.md

+
+Kubernetes is unique in that it is constantly reconciling its actual state towards some desired state. As a result, it has no definitive concept of an "operation", which breaks the traditional model for distributed tracing. This raises the question of when to begin traces, and when to end them.
+
+In this proposal, we choose to _only_ trace phases of an object's lifecycle wherein it's correcting from an undesired state to its desired state, and to end the trace when it enters this desired state. This means that the same object will export traces for each reconciliation it undergoes. This decision was made because:


What if in the meantime desired state changes and we will never reach the original desired one?

The original trace will end prematurely, and subsequent traced actions are attributed to the new desired state. Since we generally care about the slowest reconciliations, ending a trace before the process is complete should be fine.

Assuming that it's the component that has the knowledge about that previous trace...
Anyway - I think any option here is potentially fine, but I would like to see that written down in the KEP to give people chance to discuss that.

I added a pretty lengthy example of how this should work, and an explanation.

wojtek-t · 2018-12-28T15:35:24Z

keps/sig-instrumentation/0034-distributed-tracing-kep.md

+
+In the standard model for distributed tracing, there exists a span in each trace that all other spans are descendents of and which extends the length of the entire trace, called the `root span`. 
+
+The Kubernetes component that kicks off an operation might not be the same component that ends it. In this proposal, when we are at the point where we want to end a root span, we craft a span to export which acts as the root span for the trace. For example, when the kubelet updates a pod from `Pending` to `Running`, it creates a root span using the start time of the pod as the start, and the current time as the end.


What about cases when the component that is finishing doesn't know the time when it was started?
As an example, the action may be triggered by updating an object, and we generally don't persistent anywhere information about when the object was update (even last one).

Root spans are useful to have, but not critical for using tracing. Essentially what you get by adding a root span is being able to collapse the entire trace during visualizations, as all spans have a common parent. Tracing backends still calculate the total duration of all spans.

The current plan for alpha is to add root spans where possible (creation, deletion), and not where it isn't (update, reconcile).

I made a note of this.

wojtek-t · 2018-12-28T15:39:28Z

keps/sig-instrumentation/0034-distributed-tracing-kep.md

+
+In this proposal, we choose to propagate this span context as an encoded string an object annotation called `trace.kubernetes.io/context`. This annotation value is regenerated and replaced when an object's trace ends, to achieve the desired behavior from [section one](#trace-lifecycle). 
+
+This proposal chooses to use annotations as a less invasive alternative to adding a field to object metadata, but as this proposal matures, adding trace context to the official API should be considered. 


I would like to see it mentioned very clearly that additing tracing will not result in increasing amount of requests to apiserver - otherwise it may visibly impact performance of the system, which we definitely don't want.

[That also implicitly means, that "end of an operation" has to be associated with some write request to apiserver, which I'm not 100% convinced will always be the case in cases that we're interested about].

@kubernetes/sig-scalability-api-reviews

Thats a great point. It definitely will have an impact during alpha, as we are using annotations. We can remove the extra write, at least in theory, if we moved context propagation in-tree by adding the ability to update/regenerate the trace context during a status update.

The "end of an operation" always coincides with a status update from a non-desired state to the desired state in the current proposal. This implicitly means objects without a status don't receive new trace contexts outside of creation/update/deletion (i'm not convinced tracing is applicable to such objects). Do you have a case in-mind you are not sure about?

It definitely will have an impact during alpha, as we are using annotations.

I think it depends. If e.g. we say that pod creation should start a span, then we can build that into the machinery so that it will be done automatically. So I actually don't agree it has to be the case.

And just to be clear on that: I can live with this requirement not being satisfied in Alpha state, but I'm not going to approve it for beta+ if it will generally be creating higher load on apiserver (there can be some exceptions for some rare flows, but in general it cannot cause additional writes).

I added a note, and a graduation requirement for this

dashpole · 2019-01-02T19:55:56Z

@wojtek-t I will update the KEP once @Monkeyanator gives me write access.

justaugustus

Please remove any references to NEXT_KEP_NUMBER and rename the KEP to just be the draft date and KEP title.
KEP numbers will be obsolete once #703 merges.

mattfarina · 2019-01-25T16:18:07Z

keps/sig-instrumentation/0034-distributed-tracing-kep.md

+  - "@Random-Liu"
+  - "@bogdandrutu"
+approvers:
+  - "@brancz"


Since @piosz is also a chair of the SIG should @piosz be listed as an approver?

I added him, although I believe only one approver is required.

mattfarina · 2019-01-25T16:19:33Z

keps/sig-instrumentation/0034-distributed-tracing-kep.md

+  - "@Monkeyanator"
+editors:
+  - "@dashpole"
+owning-sig: sig-instrumentation


Can you use the participating-sigs: to add sig-architecture.

MikeSpreitzer · 2019-02-08T20:04:54Z

The Kubernetes control plane, and other distributed systems built like it, is indeed lacking an important form of performance observability. In systems built out of procedure calls (local and remote), this need is often addressed by a concept of "tracing" that is built around "spans", where a span corresponds to a procedure call. The Kubernetes control plane, and systems like it, are not primarily built out of procedure calls, and observability based on a concept of spans is not a good fit. This is not to deny that "latency from point A to point B" is a very relevant concept. What I am denying is that the original data should look like procedure calls with relationships among them. Rather, the original data should look like those individual points and relationships between them, because those relationships are much richer than can reasonably be captured by a collection of non-degenerate spans (by "degenerate span" I mean one of zero length, essentially representing an individual point).

In the Kubernetes control plane, work on an object is not done by a tree of procedure calls. The Kubernetes control plane is built out of controllers that monitor the state of various objects and occasionally write part of the state of certain objects. Each write is based on what was revealed by certain earlier reads --- which in turn are simply conveying what was written earlier. In short, the fundamental stuff of control plane activity is partial state writes based on other partial state writes.

For example, consider a pod. We could try to characterize what happens to a pod as a sequence of spans, where each span starts with some client requesting a change (i.e., a create, update, or delete) and ends with the implementation --- the relevant kubelet --- satisfying that request. But that is not even a good explanation of the events at the start of the life a pod. The first major state-setting event of a pod's lifecycle is a client creating the pod API object. That initial state typically does not include a binding to a particular node. The next major event is typically a scheduler doing another state write that binds the pod to a node. The final major event in the startup of a pod is the relevant kubelet doing a state write that indicates that the pod is running.

We could try to model this with spans by building into the model the idea that a pod's startup has a sequence of two spans: one from creation of API object to node binding, another from node binding to running state. We could say that the primary performance data for pod startup is built out of these two kinds of spans.

A pod is a relatively low-level API object in Kubernetes. There are many higher level objects of interest. Analysts whosse concern with pods is only about the full startup latency of a pod --- from API object create to running state --- could write queries or code that synthesizes the full startup latency out of the two constituent spans.

But it is not always that way: it is allowed for a pod to be created in a bound state. So a given pod will not necessarily have both spans. The aforementioned analysts could write more complicated queries or code to handle both scenarios.

Perhaps more likely, we could make it "the implementation's" responsibility to create the single span that represents the full startup, and identify the one or two constituent spans as children of the full startup span. What would that implementation code look like? In both OpenTracing and OpenCensus, the parent has to exist before the child is created. So a scheduler would have to create the full-startup span as well as the scheduler-work span. The kubelet would have to be prepared to create the full-startup span if it has not already been created, as well as create the kubelet-work span.

Where are those three spans stored? If the scheduler-work span and the kubelet-work span are sinks in the DAG of spans then they can simply be created when completed and emitted into the span collection framework, leaving only the full-startup span as something that needs to be stored with the pod API object. This also requires the time of the binding write (or create, whichever is appropriate for the pod at hand) to be stored in the API object, so that it is available when the kubelet opens its leaf span. So now we are also storing a state write timestamp in addition to a span. Alternatively, we can say that as soon as the binding is determined for a pod the kubelet-work span is started. This means that we are storing two spans with the API object: the full-startup span and the kubelet-work span. But we will not really be satisfied with requiring the scheduler-work span and the kubelet-work span to be leaves. In both the scheduler and the kubelet there may be a sequence of spans wherein a queue worker works on a given pod, and the parent of those spans (i.e., the full scheduler-work span or the full kubelet-work span) has to be stored with the API object. So we need the API object to hold onto multiple spans: at least the full-startup span plus one for scheduler work or one for kubelet work.

If every object alternated between an idle period, in which the desired state is fully implemented, and an active period, in which "the implementation" is working through a linear sequence of intermediate states (which always occur in the same order, and we allow an intermediate state to take zero time for some objects) along the lines discussed above, then we could always impose a span-based model as discussed above. If the implementation can follow a more general state machine during an active period then it gets more complicated. Each state transition could be modeled as a span, but an analyst interested in anything other than individual state transitions, or code trying to synthesize higher level spans, has a fair amount of complexity to cope with.

The idea of defining a state machine for an object is explicitly rejected as a good general design pattern. See the remarks about "phase" at https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md . Instead of a single monolithic phase it is recommended to have more a granular concept of status (and thus state as a whole). With state divided into several independent parts, what defines the spans for an object? I think someone or something analyzing performance may query some specific intervals, but asking the code to know beforehand exactly which pairs of events will be queried is problematic in general.

Even if we could set aside the generality problems discussed above, we are not done with pods. Consider the case of a StatefulSet, which creates and deletes pods. A StatefulSet also creates and deletes PVC objects to be used in volumes of those pods. The normal scheduler does not attempt to schedule a pod until all its volumes are ready for use; this includes waiting until referenced PVCs are bound to PVs. The controller that binds PVCs to PVs does not know or care whether or not a particular PVC was created by a StatefulSet. The scheduler does not know or care whether a particular pod is a member of a StatefulSet. The relationship between the binding of such a PVC and the scheduling of such a pod is a critical part of the performance story, but this pair of events does not look like a <update desired state of an object, indicate completion of implementation of that object> pair. That is, it does not look like what we have been talking about a span representing. It is an interesting interval, so we could require the scheduler to make a span for every <PVC used in pod volume got bound, pod-got-scheduled> interval. Note that such a span is not all about a single object, which violates the mental model we started with. Note also that a PVC's status already has "conditions", which can be used to record when the PVC got bound. But not every PVC is created by a StatefulSet for one of its pods; a PVC can be created and bound independently of a pod. Even for a PVC created for a pod in a StatefulSet, the PVC could get bound before the pod API object is created. We can not generally make a <PVC got bound, pod got scheduled> span a child of the pod's scheduling span because the former may start before the latter. Similarly, we can not generally use the parent/child relationship in the other direction either. The "FollowsFrom" relation in OpenTracing has the same problem. Actually, I do not see an explicit absolute requirement between start times of related spans in either OpenTracing nor OpenCensus, but I think that there is an intended constraint. OpenCensus also presents the additional difficulty that a given span can have at most one parent.

A more natural model would be to define a span for the PVC controller's work on binding a PVC to a PV and then ask the scheduler to establish a relationship between the PVC binding span and the scheduler's pod scheduling span. This requires the PVC constroller's span to persist on the PVC object after the span is finished. This also has the problem that there is no fixed relationship in familiar tracing terms, because, again, either of the two spans in question could start before the other.

In short, the relationship between work on a PVC and work on a pod does not fit into the existing models for relationships between spans.

There are many other examples in Kubernetes of relationships between different kinds of objects. And we can not put API objects into a containment tree. For example, the pods of one ReplicaSet may also contribute to an Endpoints object --- and also that Endpoints object may draw additional content from pods not in that ReplicaSet.

As we have already seen with pods, it is not a given that an object's implementation lies entirely in one controller; even forgetting about PVCs and such, a pod's implemetation is divided between scheduler and kubelet. With general granular state, it is not necessarily true that implementation work is handed off along a sequence of controllers.

With a web of relationships between objects with granular state with concurrent bits of implementation in progress, I do not see a clearly good way to model this with spans.

What I do see is that each state write done by a controller is based on some state that controller got in earlier reads (either explicit requests or watch notifications), where each part of that state was, in turn, set by an earlier such write. It is these state writes that are the primitive performance data, and the relationships just stated are the primitive relationships. In addition to drawing what is relevant to a given individual we may want --- just as in Prometheus, or in an SQL database --- to allow an analyst to make various queries against this primitive data and its relationships.

dashpole · 2019-02-13T22:59:06Z

@MikeSpreitzer thanks for the feedback. I've had time to digest it, and think I understand your perspective slightly better now.

The Kubernetes control plane, and other distributed systems built like it, is indeed lacking an important form of performance observability. In systems built out of procedure calls (local and remote), this need is often addressed by a concept of "tracing" that is built around "spans", where a span corresponds to a procedure call. The Kubernetes control plane, and systems like it, are not primarily built out of procedure calls, and observability based on a concept of spans is not a good fit. This is not to deny that "latency from point A to point B" is a very relevant concept. What I am denying is that the original data should look like procedure calls with relationships among them. Rather, the original data should look like those individual points and relationships between them, because those relationships are much richer than can reasonably be captured by a collection of non-degenerate spans (by "degenerate span" I mean one of zero length, essentially representing an individual point).

Distributed tracing is context-aware, structured, distributed latency logging. Though it is mainly used with procedure calls, it isn't limited to procedure calls. The only requirements I can see for using tracing in any system is being able to attach a context to a description of user intent, and propagate it to all components that act on that intent. That is fundamentally why associating a given trace context with an object's desired state is a good way to adopt the tracing model to the watch-based k8s model.

For example, consider a pod. We could try to characterize what happens to a pod as a sequence of spans, where each span starts with some client requesting a change (i.e., a create, update, or delete) and ends with the implementation --- the relevant kubelet --- satisfying that request. But that is not even a good explanation of the events at the start of the life a pod. The first major state-setting event of a pod's lifecycle is a client creating the pod API object. That initial state typically does not include a binding to a particular node. The next major event is typically a scheduler doing another state write that binds the pod to a node. The final major event in the startup of a pod is the relevant kubelet doing a state write that indicates that the pod is running.

We could try to model this with spans by building into the model the idea that a pod's startup has a sequence of two spans: one from creation of API object to node binding, another from node binding to running state. We could say that the primary performance data for pod startup is built out of these two kinds of spans.

A pod is a relatively low-level API object in Kubernetes. There are many higher level objects of interest. Analysts whose concern with pods is only about the full startup latency of a pod --- from API object create to running state --- could write queries or code that synthesizes the full startup latency out of the two constituent spans.

But it is not always that way: it is allowed for a pod to be created in a bound state. So a given pod will not necessarily have both spans. The aforementioned analysts could write more complicated queries or code to handle both scenarios.

Tracing tools already handle absent spans gracefully. For viewing the single trace, the span would simply be absent. Analysis tools aggregate spans with a single span name. So if we had a parent span k8s.CreatePod, and child spans scheduler.SchedulePod and kubelet.StartPod, we can already query over any of the three, regardless of whether scheduler.SchedulePod is present in all traces.

Perhaps more likely, we could make it "the implementation's" responsibility to create the single span that represents the full startup, and identify the one or two constituent spans as children of the full startup span. What would that implementation code look like? In both OpenTracing and OpenCensus, the parent has to exist before the child is created. So a scheduler would have to create the full-startup span as well as the scheduler-work span. The kubelet would have to be prepared to create the full-startup span if it has not already been created, as well as create the kubelet-work span.

Where are those three spans stored? If the scheduler-work span and the kubelet-work span are sinks in the DAG of spans then they can simply be created when completed and emitted into the span collection framework, leaving only the full-startup span as something that needs to be stored with the pod API object. This also requires the time of the binding write (or create, whichever is appropriate for the pod at hand) to be stored in the API object, so that it is available when the kubelet opens its leaf span. So now we are also storing a state write timestamp in addition to a span. Alternatively, we can say that as soon as the binding is determined for a pod the kubelet-work span is started. This means that we are storing two spans with the API object: the full-startup span and the kubelet-work span. But we will not really be satisfied with requiring the scheduler-work span and the kubelet-work span to be leaves. In both the scheduler and the kubelet there may be a sequence of spans wherein a queue worker works on a given pod, and the parent of those spans (i.e., the full scheduler-work span or the full kubelet-work span) has to be stored with the API object. So we need the API object to hold onto multiple spans: at least the full-startup span plus one for scheduler work or one for kubelet work.

We don't actually have to store any spans with the API object to accomplish this. As long as you have the timestamp of the start of a process, you can retroactively construct the parent span. You are correct, that by storing a few more timestamps, we could get a few more traces to wrap, for example, all of the kubelet work in a single span. But the nice thing for now is that we can just skip adding those spans when we don't have the start time, and add them in if/when we add those timestamps. Tracing tools still function even when we are missing parent spans, and just have a collection of child spans. For example, we can have scheduler.SchedulePod and kubelet.StartPod, but not have k8s.CreatePod, and things work just fine. You just can't answer queries about the distribution of k8s.CreatePod latencies.

If every object alternated between an idle period, in which the desired state is fully implemented, and an active period, in which "the implementation" is working through a linear sequence of intermediate states (which always occur in the same order, and we allow an intermediate state to take zero time for some objects) along the lines discussed above, then we could always impose a span-based model as discussed above. If the implementation can follow a more general state machine during an active period then it gets more complicated. Each state transition could be modeled as a span, but an analyst interested in anything other than individual state transitions, or code trying to synthesize higher level spans, has a fair amount of complexity to cope with.

When a kubernetes controller attempts to reconcile desired and actual state for an object, it does at least two steps:

Take some action(s)
Update state

For example, the scheduler does these two steps:

Run algorithm to find the node on which it can place the pod.
Bind the pod to the node.

While (2) is an important part of the reconciliation process, as you point out, it isn't that interesting on its own. Wrapping (1) in a span is far more interesting and useful. As we would expect from the example, the scheduler folks care immensely about how fast the schedule pod algorithm takes, and not at all about how long (2) takes.

The idea of defining a state machine for an object is explicitly rejected as a good general design pattern. See the remarks about "phase" at https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md . Instead of a single monolithic phase it is recommended to have more a granular concept of status (and thus state as a whole). With state divided into several independent parts, what defines the spans for an object? I think someone or something analyzing performance may query some specific intervals, but asking the code to know beforehand exactly which pairs of events will be queried is problematic in general.

I am not suggesting we model anything after a state machine. Let me know if something in the proposal led you to think that, and I can update it to make it clearer.

The spans for an object are any operation which advances the actual state toward the desired state. It doesn't need to be linear, and often many steps happen in parallel. This includes actions like:

Running an algorithm (e.g. schedule pod algorithm)
Calling out to another service (e.g. create a Persistent Disk from a cloud provider or calling the container runtime to create a container).
Updating the actual state
Creating/updating another object

Even if we could set aside the generality problems discussed above, we are not done with pods. Consider the case of a StatefulSet, which creates and deletes pods. A StatefulSet also creates and deletes PVC objects to be used in volumes of those pods. The normal scheduler does not attempt to schedule a pod until all its volumes are ready for use; this includes waiting until referenced PVCs are bound to PVs. The controller that binds PVCs to PVs does not know or care whether or not a particular PVC was created by a StatefulSet. The scheduler does not know or care whether a particular pod is a member of a StatefulSet. The relationship between the binding of such a PVC and the scheduling of such a pod is a critical part of the performance story, but this pair of events does not look like a <update desired state of an object, indicate completion of implementation of that object> pair. That is, it does not look like what we have been talking about a span representing. It is an interesting interval, so we could require the scheduler to make a span for every <PVC used in pod volume got bound, pod-got-scheduled> interval. Note that such a span is not all about a single object, which violates the mental model we started with. Note also that a PVC's status already has "conditions", which can be used to record when the PVC got bound. But not every PVC is created by a StatefulSet for one of its pods; a PVC can be created and bound independently of a pod. Even for a PVC created for a pod in a StatefulSet, the PVC could get bound before the pod API object is created. We can not generally make a <PVC got bound, pod got scheduled> span a child of the pod's scheduling span because the former may start before the latter. Similarly, we can not generally use the parent/child relationship in the other direction either. The "FollowsFrom" relation in OpenTracing has the same problem. Actually, I do not see an explicit absolute requirement between start times of related spans in either OpenTracing nor OpenCensus, but I think that there is an intended constraint. OpenCensus also presents the additional difficulty that a given span can have at most one parent.

A more natural model would be to define a span for the PVC controller's work on binding a PVC to a PV and then ask the scheduler to establish a relationship between the PVC binding span and the scheduler's pod scheduling span. This requires the PVC controller's span to persist on the PVC object after the span is finished. This also has the problem that there is no fixed relationship in familiar tracing terms, because, again, either of the two spans in question could start before the other.

In short, the relationship between work on a PVC and work on a pod does not fit into the existing models for relationships between spans.

Ok, I think I owe you at least a hypothetical way we could handle hierarchies in kubernetes... I haven't implemented this, but I hope it shows that it is possible to handle such object relationships relatively elegantly.
There are a couple of key observations I want to start out with:

Our goal is to attach the context to user intent, not necessarily a specific object's spec per-se.
While a controller is reconciling the desired and actual state of object A, creating or updating object B is an expression of the same user intent as object A.
- For example, when the StatefulSet controller creates a pod object, that pod represents the same user intent as the StatefulSet.

Therefore, I propose that when a controller, acting in the context of object A, modifies the desired state of object B, it should propagate that context to object B. This means each user-initiated object creation results in a single trace, since all objects created as a result of this have the trace context propagated to them. This captures the relationship between multiple objects created on behalf of a higher-level object, such as a StatefulSet, which creates both PVCs and Pods, as they are connected by the fact that they both are associated with the same StatefulSet. ~~Parent-child relationships mirror kubernetes object owner relationships.~~ This is similar, though not identical, to owner relationships. The owner of an object never changes, whereas the context of a given object is determined by the last controller to modify it, which may not be the same one that created it.

There are many other examples in Kubernetes of relationships between different kinds of objects. And we can not put API objects into a containment tree. For example, the pods of one ReplicaSet may also contribute to an Endpoints object --- and also that Endpoints object may draw additional content from pods not in that ReplicaSet.

The Endpoints object does not have a desired and actual state to reconcile. It is simply a statement of fact.

There is generally a class of "selector" objects, such as a Service, which do no "own" other objects, but rather select over them. We have started moving to a model where such objects, rather than having their own state, inject their state into other objects. See the Pod Ready++ KEP, where the "readiness" of other objects, such as endpoints, is included in the pod status, rather than a separate service status, for example. In this case, setting up a service or setting endpoints actually becomes part of the process of reconciling the pod's actual state, and thus the action of setting up a service or endpoint should use the context of the pod it is acting on when performing the actions/updates required.

As we have already seen with pods, it is not a given that an object's implementation lies entirely in one controller; even forgetting about PVCs and such, a pod's implementation is divided between scheduler and kubelet. With general granular state, it is not necessarily true that implementation work is handed off along a sequence of controllers.

With a web of relationships between objects with granular state with concurrent bits of implementation in progress, I do not see a clearly good way to model this with spans.

I have done my best to answer this above.

What I do see is that each state write done by a controller is based on some state that controller got in earlier reads (either explicit requests or watch notifications), where each part of that state was, in turn, set by an earlier such write. It is these state writes that are the primitive performance data, and the relationships just stated are the primitive relationships. In addition to drawing what is relevant to a given individual we may want --- just as in Prometheus, or in an SQL database --- to allow an analyst to make various queries against this primitive data and its relationships.

I think we should be just as interested in the actual work done by components as the status updates that reflect this work.

lavalamp · 2019-02-13T23:25:46Z

keps/sig-instrumentation/0034-distributed-tracing-kep.md

+owning-sig: sig-instrumentation
+participating-sigs:
+  - sig-architecture
+  - sig-node


If this proposes changes to api calls (parameters, headers, etc) or api objects (storing new interesting things) then api machinery probably needs to be involved...

It doesn't currently, as I plan to use annotations for the alpha stage, but it will if it moves beyond that stage. I'll add api-machinery.

lavalamp · 2019-02-13T23:32:51Z

Therefore, I propose that when a controller, acting in the context of object A, modifies the desired state of object B, it should propagate that context to object B. This means each user-initiated object creation results in a single trace, since all objects created as a result of this have the trace context propagated to them. This captures the relationship between multiple objects created on behalf of a higher-level object, such as a StatefulSet, which creates both PVCs and Pods, as they are connected by the fact that they both are associated with the same StatefulSet. Parent-child relationships mirror kubernetes object owner relationships.

(disclaimer: I haven't read anything but the prior comment)

Kubernetes doesn't make a clear distinction between users and system components.

If the information you want really does form trees, then what is missing from the existing owner references? (Also note that they are not guaranteed to be trees!)

If the information does not form trees (as I expect) then I think it is not a good idea to propagate everything. I do think it would be useful and interesting to store exactly one level of this information (e.g., list the immediate objects that caused the update, but NOT the full context that caused those objects to be last updated).

This was discussed a small amount in today's api machinery SIG. (which I haven't uploaded yet, sorry)

wojtek-t · 2019-11-28T12:41:15Z

@dashpole - thanks a lot for bearing with me with the PRR - it's extremely useful for us. The answers look reasonable to me now, so I will be working on refining the questions so that it will be more obvious for others what we actually expect.

soltysh · 2019-12-23T15:22:24Z

keps/sig-instrumentation/0034-distributed-tracing-kep.md

+1. Send the trace context stored in `Foo` in the http request context for all API requests. See [Tracing API Requests](#tracing-api-requests)
+1. Store the trace context of `Foo` in object `Bar` when updating the Spec of `Bar`. See [Propagating Context Through Objects](#propagating-context-through-objects)
+1. Export a span around work that attempts to drive the actual state of an object towards its desired state
+1. Replace the trace context of `Foo` when updating `Foo`'s status to the desired state


There's one caveat in the controllers that was not covered here. Since controllers are edge driven, if user(s) modify an object twice, one after another, this will generate two update events, both noticed by the controller code. But upon processing the controller will process the latest state of the object. For example:

user A sets replicas 2

user B sets replicas 3

controller processes update and reads replicas 3 (the last state)

controller processes 2nd event and notices it's already fulfilled and does nothing.
Based on the above your tracing event 1 will be passed down to newly created pods, but you might loose event 2.
It's not always but something that should be accounted for.

See the paragraph that starts with "Components should plumb..." a few paragraphs down. Controllers should use the context associated with the desired state to which they are updating. In your example, Step 3 would actually use the context from the second update, since that is the state to which it is updating.

This does seem to be another case where linking the previous "unfinished" trace to the new one would be helpful. That isn't currently part of the proposal, but is something that has come up as a potential solution to scenarios in which an update occurs before an object reaches its desired state.

We've dealt with similar issue when modeling state machines. When using tracing APIs like OpenTracing or OpenTelemetry, it is possible to record links to other traces when starting a new span. So if the trace context is stored in an "object" and is later updated to another trace context, they can be linked at that point, capturing the information in the trace about all events, even when replicas=2 was not actually executed.

wojtek-t · 2019-12-30T14:34:05Z

@brancz @piosz - although this KEP is probably not in implementable shape, I think it has a lot of useful data that is worth merging it in "provisional state". The whole idea behind KEP was to "merge fast and iterate", and this is already hanging around for an year.

brancz · 2020-01-07T14:53:28Z

Yeah I'm happy with merging this in provisional state, I think there are still somewhat contentious points, but we're in agreement that we want this.

/lgtm
/approve

I'm still not entirely convinced that what @soltysh mentioned is alleviated (as in concurrent actions on objects creating new contexts racing with "old in progress" ones), but I think if not then that will show in the implementation.

…hanges

Updated the motivation, added trace lifecycle description for traces that start before the previous one ends, update root span description, add scalability requirement.

dashpole · 2020-01-08T19:11:52Z

pushed updates to fix the verify test.

brancz · 2020-01-08T19:47:51Z

/lgtm

k8s-ci-robot · 2020-01-08T19:48:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brancz, Monkeyanator

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-instrumentation/OWNERS~~ [brancz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 7, 2018

k8s-ci-robot requested review from calebamiles and idvoretskyi December 7, 2018 21:44

k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. kind/feature Categorizes issue or PR as related to a new feature. labels Dec 7, 2018

k8s-ci-robot assigned brancz Dec 7, 2018

brancz reviewed Dec 8, 2018

View reviewed changes

wojtek-t self-requested a review December 10, 2018 07:30

dashpole reviewed Dec 20, 2018

View reviewed changes

wojtek-t reviewed Dec 28, 2018

View reviewed changes

k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Dec 28, 2018

dashpole mentioned this pull request Jan 16, 2019

API Server tracing #647

Open

3 tasks

justaugustus requested changes Jan 20, 2019

View reviewed changes

mattfarina reviewed Jan 25, 2019

View reviewed changes

lavalamp reviewed Feb 13, 2019

View reviewed changes

yue9944882 mentioned this pull request Dec 10, 2019

How to get the exact occurrence time of watch event? kubernetes-client/java#825

Closed

soltysh reviewed Dec 23, 2019

View reviewed changes

wojtek-t assigned piosz Dec 30, 2019

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 7, 2020

Sam Naser and others added 14 commits January 8, 2020 11:04

Draft for distributed tracing KEP

974d89f

Expand on the decision to use the OC-agent, make various formatting c…

a769b47

…hanges

Responding to comments

8a68d75

Updated the motivation, added trace lifecycle description for traces that start before the previous one ends, update root span description, add scalability requirement.

Beta requirement for OC Agent review.

bba0ab6

add participating sigs; requirements are checkboxes

7873edc

add api-machinery as a participating sig

e6c19d4

propagate traces across objects and make tracing composable

adc7d68

update link to mutating trace admission controller

2729064

add latency logging section and trace context size

5e2f536

remove unneeded WithSpanContext call

7e2a2ad

replace mutating trace admission controller with --trace in kubectl

647c314

production readiness survey

3770b00

address production survey comments and remove references to OpenCensus

f107caa

fix verify errors

9bd1cd0

dashpole force-pushed the distributed-tracing-kep branch from 2cb4b71 to 9bd1cd0 Compare January 8, 2020 19:11

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 8, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 8, 2020

k8s-ci-robot merged commit 82f7787 into kubernetes:master Jan 8, 2020

k8s-ci-robot added this to the v1.18 milestone Jan 8, 2020


		Distributed tracing, on the other hand, provides a single window into latency information from across many components and plugins. Trace data is structured, and there are numerous established backends for visualizing and querying over it. This KEP would make it possible to, for instance, retrieve and visualize all pod startups that took more than 30 seconds, involved an `nginx` container, and which mounted more than two volumes.

		In addition, due to the self-healing nature of Kubernetes, regressions wherein latencies are affected but the overall task is eventually accomplished are not uncommon. With our current monitoring architecture, these "soft regressions" are often difficult to observe and diagnose. Collecting structured trace data on per-object latencies would enable us to detect these long-term regressions automatically, and quickly determine their root causes.


		To correlate work done between components as belonging to the same trace, we must pass span context across process boundaries. In traditional distributed systems, this context can be passed down through RPC metadata or HTTP headers. Kubernetes, however, due to its watch-based nature, requires us to attach trace context directly to the target object.

		In this proposal, we choose to propagate this span context as an encoded string an object annotation called `trace.kubernetes.io/context`. This annotation value is regenerated and replaced when an object's trace ends, to achieve the desired behavior from [section one](#trace-lifecycle).


		This KEP proposes the use of the [OpenCensus tracing framework](https://opencensus.io/) to create and export spans to configured backends. The OpenCensus framework was chosen for various reasons:

		1) Provides concrete, tested implementations for creating and exporting spans to diverse backends, rather than providing an API specification, as is the case with [OpenTracing](https://opentracing.io/specification/)


		#### Context propagation

		To correlate work done between components as belonging to the same trace, we must pass span context across process boundaries. In traditional distributed systems, this context can be passed down through RPC metadata or HTTP headers. Kubernetes, however, due to its watch-based nature, requires us to attach trace context directly to the target object.


		Kubernetes is unique in that it is constantly reconciling its actual state towards some desired state. As a result, it has no definitive concept of an "operation", which breaks the traditional model for distributed tracing. This raises the question of when to begin traces, and when to end them.

		In this proposal, we choose to _only_ trace phases of an object's lifecycle wherein it's correcting from an undesired state to its desired state, and to end the trace when it enters this desired state. This means that the same object will export traces for each reconciliation it undergoes. This decision was made because:


		In the standard model for distributed tracing, there exists a span in each trace that all other spans are descendents of and which extends the length of the entire trace, called the `root span`.

		The Kubernetes component that kicks off an operation might not be the same component that ends it. In this proposal, when we are at the point where we want to end a root span, we craft a span to export which acts as the root span for the trace. For example, when the kubelet updates a pod from `Pending` to `Running`, it creates a root span using the start time of the pod as the start, and the current time as the end.


		In this proposal, we choose to propagate this span context as an encoded string an object annotation called `trace.kubernetes.io/context`. This annotation value is regenerated and replaced when an object's trace ends, to achieve the desired behavior from [section one](#trace-lifecycle).

		This proposal chooses to use annotations as a less invasive alternative to adding a field to object metadata, but as this proposal matures, adding trace context to the official API should be considered.

Propose KEP: Leveraging Distributed Tracing to Understand Kubernetes Object Lifecycles #650

Propose KEP: Leveraging Distributed Tracing to Understand Kubernetes Object Lifecycles #650

Conversation

Monkeyanator commented Dec 7, 2018

k8s-ci-robot commented Dec 7, 2018

Monkeyanator commented Dec 7, 2018 • edited Loading

k8s-ci-robot commented Dec 7, 2018

Monkeyanator commented Dec 7, 2018

brancz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Monkeyanator Dec 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dashpole Jan 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dashpole commented Jan 2, 2019

justaugustus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeSpreitzer commented Feb 8, 2019

dashpole commented Feb 13, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lavalamp commented Feb 13, 2019

wojtek-t commented Nov 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t commented Dec 30, 2019

brancz commented Jan 7, 2020

dashpole commented Jan 8, 2020

brancz commented Jan 8, 2020

k8s-ci-robot commented Jan 8, 2020

Monkeyanator commented Dec 7, 2018 •

edited

Loading

Monkeyanator Dec 10, 2018 •

edited

Loading

dashpole Jan 2, 2019 •

edited

Loading

dashpole commented Feb 13, 2019 •

edited

Loading