Scheduler `TaskState` objects should be unique, not hashed by key #7510

gjoseph92 · 2023-01-28T06:42:44Z

Currently, two TaskState objects with the same key will hash and compare as equal. However, there should only ever be one TaskState object per logical task. Therefore, if there are two TaskState objects with the same key, they must refer to logically different tasks—such as the same key being re-submitted—and should not be equal, nor hash the same. They refer to different things.

Using the key as the hash causes errors like #7504.

This is a similar theme to:

@crusaderky already fixed the equivalent problem on the worker side in #6593.

Like there, I think we should simply remove __hash__ and __eq__ from TaskState on the scheduler. Then we'll automatically get what we want, where only TaskStates with the same id are equal:

User-defined classes have __eq__() and __hash__() methods by default; with them, all objects compare unequal (except with themselves) and x.__hash__() returns an appropriate value such that x == y implies both that x is y and hash(x) == hash(y).
https://docs.python.org/3/reference/datamodel.html#object.__hash__

See prior discussion in #6593 (comment), #6585 (comment).

Note that I could see an argument for instead including the recently-added run_id #7463 in the hash, in order to disambiguate between reruns of the same key. That would probably also fix things, for now, but I don't see the advantage of it. Since there should never be multiple TaskStates per task in the first place, what's the need for a custom __hash__ or __eq__ method? The default identity-based method is the simplest and most correct.

The text was updated successfully, but these errors were encountered:

hendrikmakait · 2023-01-30T10:21:42Z

Note that I could see an argument for instead including the recently-added run_id #7463 in the hash, in order to disambiguate between reruns of the same key. That would probably also fix things, for now, but I don't see the advantage of it. Since there should never be multiple TaskStates per task in the first place, what's the need for a custom __hash__ or __eq__ method? The default identity-based method is the simplest and most correct.

I would strongly advise against using run_id in this context. It is meant to distinguish between several physical task executions on the workers of the same logical task, not to distinguish between logical tasks themselves. Distinguishing between logical tasks should be handled on its own.

I generally second having unique logical tasks, but I am not certain if simply removing __eq__ and __hash__ will get us all the way. Wouldn't this render data stored on the cluster with persist() inaccessible by subsequent compute() calls?

gjoseph92 · 2023-01-31T00:37:41Z

Ah, thanks for clarifying. Makes sense, I hadn't looked that carefully at what run_id actually meant.

Wouldn't this render data stored on the cluster with persist() inaccessible by subsequent compute() calls?

No, because in the subsequent compute call, we look up existing TaskState objects for each key, and if they exist, we reuse them:

distributed/distributed/scheduler.py

Lines 4408 to 4410 in d74f500

    
           ts = self.tasks.get(k) 
        
           if ts is None: 
        
               ts = self.new_task(k, tasks.get(k), "released", computation=computation)

fjetter · 2023-01-31T08:36:23Z

I'm not entirely convinced that simply swapping out the hash function is sufficient. Most mappings that include TaskState objects still use the key which is not unique. To address this problem we need to replace the internal usage of the key with a truly unique ID.

gjoseph92 · 2023-01-31T19:26:48Z

Most mappings that include TaskState objects still use the key which is not unique

What mappings hold TaskStates by key besides scheduler.tasks? To me, scheduler.tasks is the source of truth for whether a TaskState is stale or not, so there's no issue there. If there are other Mapping[str, TaskState]s somewhere, then yes, they probably should check that scheduler.tasks[ts.key] is ts, not just ts.key in scheduler.tasks. (Same sort of change as in #6585, but with tasks instead of WorkerState.)

I'm not entirely convinced that simply swapping out the hash function is sufficient.

I don't claim that just swapping the hash function is sufficient to resolve all problems with re-submitted tasks and stale TaskState objects with the same key. I'm just saying it's more correct, and it would resolve problems that can occur when storing TaskStates in set-like objects.

It would fix #7504, and seems like the correct behavior—is there a reason not to do it?

fjetter · 2023-02-08T14:47:44Z

What mappings hold TaskStates by key besides scheduler.tasks?

My argument is more general. I don't think distributed should use keys internally for most of what it is using them for right now. That's a bigger topic, of course.

It would fix #7504, and seems like the correct behavior—is there a reason not to do it?

Not pushing back on this. If you can produce a test for it, I'm happy

gjoseph92 mentioned this issue Jan 28, 2023

When tasks are re-submitted with the same key, Scheduler.queued may return stale TaskState objects #7504

Open

dask deleted a comment from fjetter Feb 8, 2023

gjoseph92 linked a pull request Feb 9, 2023 that will close this issue

Resolve stale TaskStates having colliding hashes #7528

Open

2 tasks

fjetter mentioned this issue Oct 6, 2023

Corrupted state machine: forgotten task in Scheduler.queued #8236

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler `TaskState` objects should be unique, not hashed by key #7510

Scheduler `TaskState` objects should be unique, not hashed by key #7510

gjoseph92 commented Jan 28, 2023

hendrikmakait commented Jan 30, 2023 •

edited

Loading

gjoseph92 commented Jan 31, 2023

fjetter commented Jan 31, 2023

gjoseph92 commented Jan 31, 2023

fjetter commented Feb 8, 2023

Scheduler TaskState objects should be unique, not hashed by key #7510

Scheduler TaskState objects should be unique, not hashed by key #7510

Comments

gjoseph92 commented Jan 28, 2023

hendrikmakait commented Jan 30, 2023 • edited Loading

gjoseph92 commented Jan 31, 2023

fjetter commented Jan 31, 2023

gjoseph92 commented Jan 31, 2023

fjetter commented Feb 8, 2023

Scheduler `TaskState` objects should be unique, not hashed by key #7510

Scheduler `TaskState` objects should be unique, not hashed by key #7510

hendrikmakait commented Jan 30, 2023 •

edited

Loading