Transition tracing for scheduler task transitions #5849

fjetter · 2022-02-22T14:05:30Z

The worker currently implements a tracing system to link cause and effect and follow all transitions that were triggered by a given stimulus. This trace ID is usually referred to as stimulus_id.

The scheduler generates some of these stimulus_ids and includes them in RPC calls to the worker in a few places. However, it does not trace its own transitions making it very hard to infer why such a stimulus was generated. Introducing the same system on scheduler side and including the appropriate IDs in requests to the worker would allow us to close the circle and reconstruct a cluster wide history and link all transitions which were caused by an event.

The most difficult thing to figure out is where to generate the unique stimulus_ids since if we just keep on passing the IDs through every call, every transition would be linked by the same ID.

My thinking is that new events/stimulus IDs should generated on the following events (please correct me if I miss anything)

(Scheduler) Update graph
(Scheduler) Remove worker
(Scheduler) Remove client
(Scheduler) steal-request
(Scheduler) delete-worker-data
(Scheduler) everything AMM does
(Client/Scheduler) cancel key
(Worker) task-finished
(Worker) task-erred
(Worker) add_keys (new replica)

All other state modifying handlers should accept a stimulus ID and forward it accordingly through the transition enginer.

Similar to the worker, the story should not only filter on keys but also stim IDs.

The text was updated successfully, but these errors were encountered:

sjperkins · 2022-03-18T13:52:49Z

(Scheduler) everything AMM does

@crusaderky, it looks like there are already stimuli generated by ActiveMemoryManagerExtension._enact_suggestions. The rest of the code seems to populate the suggestions. Do you think the existing stimuli in ActiveMemoryManagerExtension._enact_suggestions are sufficiently general for capturing AMM behaviour?

crusaderky · 2022-03-18T16:12:58Z

(Scheduler) everything AMM does

@crusaderky, it looks like there are already stimuli generated by ActiveMemoryManagerExtension._enact_suggestions. The rest of the code seems to populate the suggestions. Do you think the existing stimuli in ActiveMemoryManagerExtension._enact_suggestions are sufficiently general for capturing AMM behaviour?

They capture what the AMM decided to do, but not why. The why is currently captured by enabling the (extremely verbose) task_logger. There may be more structured ways.

fjetter changed the title ~~Issue Title Transition tracing for scheduler task transitions~~ Transition tracing for scheduler task transitions Feb 22, 2022

This was referenced Feb 28, 2022

Client.story - Support collecting cluster-wide story for a key or stimulus ID #5872

Closed

(Worker) State Machine determinism and replayability #5736

Closed

fjetter assigned sjperkins Mar 15, 2022

sjperkins mentioned this issue Mar 17, 2022

Scheduler task transition tracing #5954

Closed

3 tasks

This was referenced Mar 28, 2022

Support stimulus_id's via Wrapped RPC handlers and ContextVars #6010

Closed

Support Stimulus ID's in Scheduler with ContextVars #6046

Closed

This was referenced Apr 7, 2022

Stimulus id contextvars explicit handlers #6083

Closed

Support Stimulus ID's via argument passing #6095

Closed

[Retrospective] Use of ContextVars for passing stimulus_id's within the Scheduler #6107

Open

mrocklin closed this as completed in 3e0f702 Apr 26, 2022

fjetter mentioned this issue Jan 18, 2023

DNM: Tracing Dask with OpenTelemetry #7484

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transition tracing for scheduler task transitions #5849

Transition tracing for scheduler task transitions #5849

fjetter commented Feb 22, 2022

sjperkins commented Mar 18, 2022

crusaderky commented Mar 18, 2022

Transition tracing for scheduler task transitions #5849

Transition tracing for scheduler task transitions #5849

Comments

fjetter commented Feb 22, 2022

sjperkins commented Mar 18, 2022

crusaderky commented Mar 18, 2022