Distributed request tracing #4718

fjetter · 2021-04-20T13:05:45Z

I frequently find myself in a position where I need to answer questions like: "Who triggered this coroutine? Why was this coroutine triggered?", "Was this triggered by a remote call via a stream handler or due to some internal mechanics? If it was remote, who was it?". Is this the result of a cascade of tasks or was this an isolated request?

Most, if not all of these questions can be answered by different debugging techniques, starting from print statements or logging with some context information or by using a debugger. With the heavy usage of asyncio these debugging techniques have their limits and are also not straight forward to setup.

We started adding keywords like cause or reason to some functions, e.g. Worker.release_key which adds a free text field which can be optionally used to attach some context information. This hasn't been used consistently through the code base (as you can see we even have two different kwargs for the same thing in this method, one of them is not used, guess which?)

In a microservice world, this problem has been solved by distributed tracing. Oversimplified, this works by attaching a unique ID to every request and include said ID to subsequently spawned requests, etc. This allows one to correlated the initial trigger of a cascade of events and helps one to understand why and how certain things are happening.

I'm wondering if we can benefit from something like this in our distributed context as well, in particular regarding the administration work scheduler, worker, nanny, etc. perform.

If this is considered useful, I'm wondering if we should come up with our own definition for such metadata or adhere to some available tooling. If there is tooling, what would be appropriate?

Any thoughts or previous experience on this topic?

The text was updated successfully, but these errors were encountered:

mrocklin · 2021-04-20T13:29:58Z

Certainly when we started adding more structured logging, like the transitions log, debuggability went way up and development became smoother. Also, my recollection is that cause= in some cases refers to another key. I could be wrong about this though, or I could be associating this memory to some other method.

…

On Tue, Apr 20, 2021 at 8:06 AM Florian Jetter ***@***.***> wrote: I frequently find myself in a position where I need to answer questions like: "Who triggered this coroutine? Why was this coroutine triggered?", "Was this triggered by a remote call via a stream handler or due to some internal mechanics? If it was remote, who was it?". Is this the result of a cascade of tasks or was this an isolated request? Most, if not all of these questions can be answered by different debugging techniques, starting from print statements or logging with some context information or by using a debugger. With the heavy usage of asyncio these debugging techniques have their limits and are also not straight forward to setup. We started adding keywords like cause or reason to some functions, e.g. Worker.release_key <https://github.com/dask/distributed/blob/d5fc324bdef22f19d77c3d36e63a3778ceaac0b0/distributed/worker.py#L2410> which adds a free text field which can be optionally used to attach some context information. This hasn't been used consistently through the code base (as you can see we even have two different kwargs for the same thing in this method, one of them is not used, guess which?) In a microservice world, this problem has been *solved* by distributed tracing. Oversimplified, this works by attaching a unique ID to every request and include said ID to subsequently spawned requests, etc. This allows one to correlated the initial trigger of a cascade of events and helps one to understand why and how certain things are happening. I'm wondering if we can benefit from something like this in our distributed context as well, in particular regarding the administration work scheduler, worker, nanny, etc. perform. If this is considered useful, I'm wondering if we should come up with our own definition for such metadata or adhere to some available tooling. If there is tooling, what would be appropriate? Any thoughts or previous experience on this topic? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4718>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTC5ZQHQ54ZFGGJCRM3TJV337ANCNFSM43IEEIIA> .

fjetter · 2021-04-20T13:56:18Z

my recollection is that cause= in some cases refers to another key.

Indeed. That makes things just much worse since it is sometimes used as a task key with actual semantic meaning and sometimes just as added context information.

we started adding more structured logging, like the transitions log,

Yes, I think the important bit here is the structured logging part. Structuring this information and storing it somewhere would allow something similar to story but for coroutines. I would be thrilled to see this as a library but I couldn't find much other than the full blown "distributed request tracing" frameworks for webservices. This feels like a rather common problem with async services and I'd be surprised if we were the first stumbling over this. Maybe we are, then it's probably fine to build something ourselves...

fjetter mentioned this issue Apr 28, 2021

Logging standardization - Contextual logging - Structured logging #4762

Open

hendrikmakait mentioned this issue Oct 21, 2022

Track reason of workers closing and restarting #7166

Merged

2 tasks

fjetter mentioned this issue Jan 18, 2023

DNM: Tracing Dask with OpenTelemetry #7484

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed request tracing #4718

Distributed request tracing #4718

fjetter commented Apr 20, 2021

mrocklin commented Apr 20, 2021 via email

fjetter commented Apr 20, 2021

Distributed request tracing #4718

Distributed request tracing #4718

Comments

fjetter commented Apr 20, 2021

mrocklin commented Apr 20, 2021 via email

fjetter commented Apr 20, 2021