Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler task transition tracing #5954

Closed

Conversation

sjperkins
Copy link
Member

@sjperkins sjperkins commented Mar 17, 2022

@sjperkins sjperkins force-pushed the scheduler-task-transition-tracing branch from 30dba93 to c229a65 Compare March 17, 2022 11:11
@github-actions
Copy link
Contributor

github-actions bot commented Mar 17, 2022

Unit Test Results

       18 files  ±0         18 suites  ±0   10h 15m 31s ⏱️ + 20m 12s
  2 703 tests ±0    2 616 ✔️  - 5       83 💤 +2  4 +3 
24 163 runs  ±0  22 869 ✔️  - 5  1 290 💤 +3  4 +2 

For more details on these failures, see this check.

Results for commit 7bf0117. ± Comparison against base commit 2ff681c.

♻️ This comment has been updated with latest results.

distributed/client.py Outdated Show resolved Hide resolved
distributed/utils_comm.py Show resolved Hide resolved
distributed/utils_comm.py Outdated Show resolved Hide resolved
distributed/worker.py Outdated Show resolved Hide resolved
distributed/worker.py Outdated Show resolved Hide resolved
distributed/scheduler.py Outdated Show resolved Hide resolved
distributed/scheduler.py Outdated Show resolved Hide resolved
distributed/scheduler.py Outdated Show resolved Hide resolved
distributed/scheduler.py Outdated Show resolved Hide resolved
distributed/scheduler.py Outdated Show resolved Hide resolved
@sjperkins
Copy link
Member Author

sjperkins commented Mar 22, 2022

Thanks for the comments @fjetter. I think this is ready for further review.

One observation I'd like to make is that there are a number of members that follow the following pattern:

def do_something(..., stimulus_id=None):
  stimulus_id = stimulus_id of f"do-something-{time()}"

For example, reschedule: #5954 (comment).

These patterns exist to ensure that None stimulus_id's aren't added to the transition_log. Note that valid stimulus_id's are only enforced if SchedulerState._validate is True: see #5954 (comment). I decided that scheduler stability was more important than an invalid transition_log, but I'm interested in hearing other viewpoints and given that stimulus_id=None is a kwarg in many member functions.

A stack-based approach to generating stimulus_id's might be a more robust option for a future PR. Edit: Although this may be impractical given the async paradigm.

@sjperkins sjperkins self-assigned this Mar 22, 2022
@sjperkins sjperkins mentioned this pull request Mar 24, 2022
3 tasks
Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments. The biggest thing is that I would like us try my suggestion about keeping the stimulus_id high level and not pass it down to every transition method. That would be really nice, I think. A similar thing could be done on worker side, I believe but I suggest to not change anything about the worker signatures in this PR.

distributed/scheduler.py Outdated Show resolved Hide resolved
distributed/scheduler.py Show resolved Hide resolved
distributed/scheduler.py Show resolved Hide resolved
a: tuple = parent._transition(key, finish, *args, **kwargs)
a: tuple = parent._transition(
key, finish, *args, stimulus_id=stimulus_id, **kwargs
)
recommendations, client_msgs, worker_msgs = a
self.send_all(client_msgs, worker_msgs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if a cleaner approach to this would be to not add stimulus_id to every transition_X_Y method but instead deal with the required mutations here.

the only thing a transition_X_Y method can (should) do with the stimulus_id is to attach it to a worker or client message. However, why don't we attach this to every worker message here and save ourselves these dirty signature?

e.g.

def transition_memory_forgotten(self, key):
    ...
    # This is the only place we're actually using the stim ID. _propagate_forgotten only adds it to the worker_msgs. we can add this on a higher level and don't need to pass it down into every method.
    _propagate_forgotten(
        self, ts, recommendations, worker_msgs
    )
    return recommendations, client_msgs, worker_msgs

I haven't verified if this works but it would be much less invasive.
Adding the stim ID could be performed as part of send_all where we're iterating over the messages anyhow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look at the 15 transitions in SchedulerState, of which 6 take stimulus_id's, while 9 do not. I think the two numbers are close enough that either approach might be ugly and my vote would be for the ugliness of extra stimulus kwargs in the 15 transition functions.

I was thinking about this a bit more and two other approaches occurred to me. Here's the flavour of the first:

import inspect

class TransitionFunction:
    def __init__(self, fn):
        assert callable(fn)
        self.sig = inspect.signature(fn)

    def stimulus_in_sig(self):
        pass  # implement

    def __callable__(self, *args, **kw):
        if not self.stimulus_in_sig():
            kw.pop("stimulus_id", None)
            return self.fn(*args, **kw)
       return self.fn(*args, **kw)

class SchedulerState:
    self._transitions_table = {
        ("released", "waiting"): TransitionFunction(self.transition_released_waiting),
        ("waiting", "released"): TransitionFunction(self.transition_waiting_released),
        ...
        ("released", "erred"): TransitionFunction(self.transition_released_erred),
    }    

The second idea is not yet fully formed but it might be possible to automatically generate and inject stimulus_id's into the distributed.core.Server class handlers. Then, it might be possible to store the stimulus_id's in ContextVars that can be passed through async/sync call frames and inspected at the point where we need the stimulus_id's. This could be combined with Tensorflow's ideas around variable scoping (e.g. see https://www.tensorflow.org/api_docs/python/tf/name_scope)

It also might be possible to track the call frames inspect.currentframe() back to distributed.core.Server handlers and automatically derive stimulus_id's.

I think the nice thing about this approach is it discard's the need to generate and pass stimulus_id's around the code base -- one could simply retrieve an appropriately generated stimulus. On the other hand, the logic might be too complicated and magical.

Pinging @graingert in case there's some problem with this approach and I spend to much time down this rabbithole.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stated in a simpler way, I'm thinking of an ExitStack-like construct which

  1. Would be automatically initialised with supplied or generated stimulus_id's at the Server handlers
  2. Supports overriding of existing stimulus_id's throughout Server sub-classes.
  3. Is safe for use with asyncio (I think contextvars gives us this).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stated in a simpler way, I'm thinking of an ExitStack-like construct which

Looks like AsyncExitStack is a possibility here.

distributed/scheduler.py Show resolved Hide resolved
distributed/scheduler.py Outdated Show resolved Hide resolved
distributed/worker.py Outdated Show resolved Hide resolved
distributed/worker.py Show resolved Hide resolved
distributed/worker.py Outdated Show resolved Hide resolved
distributed/worker.py Show resolved Hide resolved
@sjperkins sjperkins requested a review from fjetter March 25, 2022 16:30
@sjperkins
Copy link
Member Author

Closed in favour of #6095

@sjperkins sjperkins closed this Apr 8, 2022
@sjperkins sjperkins deleted the scheduler-task-transition-tracing branch April 8, 2022 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Transition tracing for scheduler task transitions
2 participants