Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transition tracing for scheduler task transitions #5849

Closed
Tracked by #5736
fjetter opened this issue Feb 22, 2022 · 2 comments
Closed
Tracked by #5736

Transition tracing for scheduler task transitions #5849

fjetter opened this issue Feb 22, 2022 · 2 comments
Assignees

Comments

@fjetter
Copy link
Member

fjetter commented Feb 22, 2022

The worker currently implements a tracing system to link cause and effect and follow all transitions that were triggered by a given stimulus. This trace ID is usually referred to as stimulus_id.

The scheduler generates some of these stimulus_ids and includes them in RPC calls to the worker in a few places. However, it does not trace its own transitions making it very hard to infer why such a stimulus was generated. Introducing the same system on scheduler side and including the appropriate IDs in requests to the worker would allow us to close the circle and reconstruct a cluster wide history and link all transitions which were caused by an event.

The most difficult thing to figure out is where to generate the unique stimulus_ids since if we just keep on passing the IDs through every call, every transition would be linked by the same ID.

My thinking is that new events/stimulus IDs should generated on the following events (please correct me if I miss anything)

  • (Scheduler) Update graph
  • (Scheduler) Remove worker
  • (Scheduler) Remove client
  • (Scheduler) steal-request
  • (Scheduler) delete-worker-data
  • (Scheduler) everything AMM does
  • (Client/Scheduler) cancel key
  • (Worker) task-finished
  • (Worker) task-erred
  • (Worker) add_keys (new replica)

All other state modifying handlers should accept a stimulus ID and forward it accordingly through the transition enginer.

Similar to the worker, the story should not only filter on keys but also stim IDs.

@fjetter fjetter changed the title Issue Title Transition tracing for scheduler task transitions Transition tracing for scheduler task transitions Feb 22, 2022
@sjperkins
Copy link
Member

(Scheduler) everything AMM does

@crusaderky, it looks like there are already stimuli generated by ActiveMemoryManagerExtension._enact_suggestions. The rest of the code seems to populate the suggestions. Do you think the existing stimuli in ActiveMemoryManagerExtension._enact_suggestions are sufficiently general for capturing AMM behaviour?

@crusaderky
Copy link
Collaborator

(Scheduler) everything AMM does

@crusaderky, it looks like there are already stimuli generated by ActiveMemoryManagerExtension._enact_suggestions. The rest of the code seems to populate the suggestions. Do you think the existing stimuli in ActiveMemoryManagerExtension._enact_suggestions are sufficiently general for capturing AMM behaviour?

They capture what the AMM decided to do, but not why. The why is currently captured by enabling the (extremely verbose) task_logger. There may be more structured ways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment