Scheduler task transition tracing #5954

sjperkins · 2022-03-17T11:09:02Z

Closes Transition tracing for scheduler task transitions #5849
Tests added / passed
Passes pre-commit run --all-files

…| None'

github-actions · 2022-03-17T11:48:08Z

Unit Test Results

      18 files ±0       18 suites ±0 10h 15m 31s ⏱️ + 20m 12s
  2 703 tests ±0   2 616 ✔️ - 5     83 💤 +2 4 ❌ +3
24 163 runs ±0 22 869 ✔️ - 5 1 290 💤 +3 4 ❌ +2

For more details on these failures, see this check.

Results for commit 7bf0117. ± Comparison against base commit 2ff681c.

♻️ This comment has been updated with latest results.

distributed/client.py

distributed/utils_comm.py

distributed/worker.py

distributed/scheduler.py

sjperkins · 2022-03-22T14:25:08Z

Thanks for the comments @fjetter. I think this is ready for further review.

One observation I'd like to make is that there are a number of members that follow the following pattern:

def do_something(..., stimulus_id=None):
  stimulus_id = stimulus_id of f"do-something-{time()}"

For example, reschedule: #5954 (comment).

These patterns exist to ensure that None stimulus_id's aren't added to the transition_log. Note that valid stimulus_id's are only enforced if SchedulerState._validate is True: see #5954 (comment). I decided that scheduler stability was more important than an invalid transition_log, but I'm interested in hearing other viewpoints and given that stimulus_id=None is a kwarg in many member functions.

~~A stack-based approach to generating stimulus_id's might be a more robust option for a future PR~~. Edit: Although this may be impractical given the async paradigm.

fjetter

A few minor comments. The biggest thing is that I would like us try my suggestion about keeping the stimulus_id high level and not pass it down to every transition method. That would be really nice, I think. A similar thing could be done on worker side, I believe but I suggest to not change anything about the worker signatures in this PR.

distributed/scheduler.py

fjetter · 2022-03-24T13:40:25Z

distributed/scheduler.py

-        a: tuple = parent._transition(key, finish, *args, **kwargs)
+        a: tuple = parent._transition(
+            key, finish, *args, stimulus_id=stimulus_id, **kwargs
+        )
        recommendations, client_msgs, worker_msgs = a
        self.send_all(client_msgs, worker_msgs)


I'm wondering if a cleaner approach to this would be to not add stimulus_id to every transition_X_Y method but instead deal with the required mutations here.

the only thing a transition_X_Y method can (should) do with the stimulus_id is to attach it to a worker or client message. However, why don't we attach this to every worker message here and save ourselves these dirty signature?

e.g.

def transition_memory_forgotten(self, key): ... # This is the only place we're actually using the stim ID. _propagate_forgotten only adds it to the worker_msgs. we can add this on a higher level and don't need to pass it down into every method. _propagate_forgotten( self, ts, recommendations, worker_msgs ) return recommendations, client_msgs, worker_msgs

I haven't verified if this works but it would be much less invasive.
Adding the stim ID could be performed as part of send_all where we're iterating over the messages anyhow.

I had a look at the 15 transitions in SchedulerState, of which 6 take stimulus_id's, while 9 do not. I think the two numbers are close enough that either approach might be ugly and my vote would be for the ugliness of extra stimulus kwargs in the 15 transition functions.

I was thinking about this a bit more and two other approaches occurred to me. Here's the flavour of the first:

import inspect class TransitionFunction: def __init__(self, fn): assert callable(fn) self.sig = inspect.signature(fn) def stimulus_in_sig(self): pass # implement def __callable__(self, *args, **kw): if not self.stimulus_in_sig(): kw.pop("stimulus_id", None) return self.fn(*args, **kw) return self.fn(*args, **kw) class SchedulerState: self._transitions_table = { ("released", "waiting"): TransitionFunction(self.transition_released_waiting), ("waiting", "released"): TransitionFunction(self.transition_waiting_released), ... ("released", "erred"): TransitionFunction(self.transition_released_erred), }

The second idea is not yet fully formed but it might be possible to automatically generate and inject stimulus_id's into the distributed.core.Server class handlers. Then, it might be possible to store the stimulus_id's in ContextVars that can be passed through async/sync call frames and inspected at the point where we need the stimulus_id's. This could be combined with Tensorflow's ideas around variable scoping (e.g. see https://www.tensorflow.org/api_docs/python/tf/name_scope)

It also might be possible to track the call frames inspect.currentframe() back to distributed.core.Server handlers and automatically derive stimulus_id's.

I think the nice thing about this approach is it discard's the need to generate and pass stimulus_id's around the code base -- one could simply retrieve an appropriately generated stimulus. On the other hand, the logic might be too complicated and magical.

Pinging @graingert in case there's some problem with this approach and I spend to much time down this rabbithole.

Stated in a simpler way, I'm thinking of an ExitStack-like construct which

Would be automatically initialised with supplied or generated stimulus_id's at the Server handlers

Supports overriding of existing stimulus_id's throughout Server sub-classes.

Is safe for use with asyncio (I think contextvars gives us this).

Stated in a simpler way, I'm thinking of an ExitStack-like construct which

Looks like AsyncExitStack is a possibility here.

distributed/scheduler.py

distributed/worker.py

This reverts commit 925c2bd.

sjperkins · 2022-04-08T15:08:25Z

Closed in favour of #6095

sjperkins added 4 commits March 16, 2022 17:21

Initial commit

2957f38

Defer to parent stimulus in transition_processing_memory

e2f54a3

Change exception_blame type to TaskState | None

5307c0e

Change exception_blame back to TaskState and make stimulus_id's 'str …

c229a65

…| None'

sjperkins force-pushed the scheduler-task-transition-tracing branch from 30dba93 to c229a65 Compare March 17, 2022 11:11

sjperkins added 9 commits March 17, 2022 15:45

Remove stimulus_id type annotations

d010b2d

Add stimulus_id's in various places

6cfc1ef

Merge branch 'main' into scheduler-task-transition-tracing

2ced7c1

Add stimulus_id's in state mutating handlers

cb15e74

Reintroduce log handlers

21302bd

Merge branch 'main' into scheduler-task-transition-tracing

2cf4a8a

Checkpoint running test suite

1670906

Remove cruft

8c577f8

Add stimulus_id to scheduler transition_log

dabfefe

fjetter reviewed Mar 22, 2022

View reviewed changes

sjperkins added 5 commits March 22, 2022 13:55

Support worker + client stimuli, stimuli in http

be43e63

assert_worker_story -> assert_story and test scheduler story

45bf7de

Remove stimulus_id's from code paths that don't change TaskState

80ffddf

Merge branch 'main' into scheduler-task-transition-tracing

a246197

Further stimulus_id changes in worker.py

1a72c88

sjperkins commented Mar 22, 2022

View reviewed changes

distributed/scheduler.py Show resolved Hide resolved

sjperkins commented Mar 22, 2022

View reviewed changes

distributed/scheduler.py Show resolved Hide resolved

sjperkins self-assigned this Mar 22, 2022

Add missing stimulus_id to free-keys worker msg

7c30cfa

sjperkins mentioned this pull request Mar 24, 2022

Client story #5987

Closed

3 tasks

fjetter reviewed Mar 24, 2022

View reviewed changes

sjperkins added 2 commits March 25, 2022 16:01

Review changes

b0fd952

Merge branch 'main' into scheduler-task-transition-tracing

90039fb

sjperkins added 3 commits March 25, 2022 16:41

Fix stray assert_worker_story -> assert_story

3affd07

Used new_stimulus_id in Worker.execute

925c2bd

Revert "Used new_stimulus_id in Worker.execute"

6e58f91

This reverts commit 925c2bd.

sjperkins requested a review from fjetter March 25, 2022 16:30

sjperkins added 2 commits March 31, 2022 14:53

Merge branch 'main' into scheduler-task-transition-tracing

6d6a0ee

Different stimulus_id's success/failure in Worker.execute

7bf0117

sjperkins mentioned this pull request Apr 4, 2022

Support Stimulus ID's in Scheduler with ContextVars #6046

Closed

3 tasks

sjperkins closed this Apr 8, 2022

sjperkins deleted the scheduler-task-transition-tracing branch April 8, 2022 15:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler task transition tracing #5954

Scheduler task transition tracing #5954

sjperkins commented Mar 17, 2022 •

edited

Loading

github-actions bot commented Mar 17, 2022 •

edited

Loading

sjperkins commented Mar 22, 2022 •

edited

Loading

fjetter left a comment

fjetter Mar 24, 2022

sjperkins Mar 25, 2022

sjperkins Mar 25, 2022

sjperkins Mar 28, 2022

sjperkins commented Apr 8, 2022

Scheduler task transition tracing #5954

Scheduler task transition tracing #5954

Conversation

sjperkins commented Mar 17, 2022 • edited Loading

github-actions bot commented Mar 17, 2022 • edited Loading

Unit Test Results

sjperkins commented Mar 22, 2022 • edited Loading

fjetter left a comment

Choose a reason for hiding this comment

fjetter Mar 24, 2022

Choose a reason for hiding this comment

sjperkins Mar 25, 2022

Choose a reason for hiding this comment

sjperkins Mar 25, 2022

Choose a reason for hiding this comment

sjperkins Mar 28, 2022

Choose a reason for hiding this comment

sjperkins commented Apr 8, 2022

sjperkins commented Mar 17, 2022 •

edited

Loading

github-actions bot commented Mar 17, 2022 •

edited

Loading

sjperkins commented Mar 22, 2022 •

edited

Loading