Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler TaskState objects should be unique, not hashed by key #7510

Open
gjoseph92 opened this issue Jan 28, 2023 · 5 comments · May be fixed by #7528
Open

Scheduler TaskState objects should be unique, not hashed by key #7510

gjoseph92 opened this issue Jan 28, 2023 · 5 comments · May be fixed by #7528

Comments

@gjoseph92
Copy link
Collaborator

Currently, two TaskState objects with the same key will hash and compare as equal. However, there should only ever be one TaskState object per logical task. Therefore, if there are two TaskState objects with the same key, they must refer to logically different tasks—such as the same key being re-submitted—and should not be equal, nor hash the same. They refer to different things.

Using the key as the hash causes errors like #7504.

This is a similar theme to:

@crusaderky already fixed the equivalent problem on the worker side in #6593.

Like there, I think we should simply remove __hash__ and __eq__ from TaskState on the scheduler. Then we'll automatically get what we want, where only TaskStates with the same id are equal:

User-defined classes have __eq__() and __hash__() methods by default; with them, all objects compare unequal (except with themselves) and x.__hash__() returns an appropriate value such that x == y implies both that x is y and hash(x) == hash(y).
https://docs.python.org/3/reference/datamodel.html#object.__hash__

See prior discussion in #6593 (comment), #6585 (comment).

Note that I could see an argument for instead including the recently-added run_id #7463 in the hash, in order to disambiguate between reruns of the same key. That would probably also fix things, for now, but I don't see the advantage of it. Since there should never be multiple TaskStates per task in the first place, what's the need for a custom __hash__ or __eq__ method? The default identity-based method is the simplest and most correct.

@hendrikmakait
Copy link
Member

hendrikmakait commented Jan 30, 2023

Note that I could see an argument for instead including the recently-added run_id #7463 in the hash, in order to disambiguate between reruns of the same key. That would probably also fix things, for now, but I don't see the advantage of it. Since there should never be multiple TaskStates per task in the first place, what's the need for a custom __hash__ or __eq__ method? The default identity-based method is the simplest and most correct.

I would strongly advise against using run_id in this context. It is meant to distinguish between several physical task executions on the workers of the same logical task, not to distinguish between logical tasks themselves. Distinguishing between logical tasks should be handled on its own.

I generally second having unique logical tasks, but I am not certain if simply removing __eq__ and __hash__ will get us all the way. Wouldn't this render data stored on the cluster with persist() inaccessible by subsequent compute() calls?

@gjoseph92
Copy link
Collaborator Author

Ah, thanks for clarifying. Makes sense, I hadn't looked that carefully at what run_id actually meant.

Wouldn't this render data stored on the cluster with persist() inaccessible by subsequent compute() calls?

No, because in the subsequent compute call, we look up existing TaskState objects for each key, and if they exist, we reuse them:

ts = self.tasks.get(k)
if ts is None:
ts = self.new_task(k, tasks.get(k), "released", computation=computation)

@fjetter
Copy link
Member

fjetter commented Jan 31, 2023

I'm not entirely convinced that simply swapping out the hash function is sufficient. Most mappings that include TaskState objects still use the key which is not unique. To address this problem we need to replace the internal usage of the key with a truly unique ID.

@gjoseph92
Copy link
Collaborator Author

Most mappings that include TaskState objects still use the key which is not unique

What mappings hold TaskStates by key besides scheduler.tasks? To me, scheduler.tasks is the source of truth for whether a TaskState is stale or not, so there's no issue there. If there are other Mapping[str, TaskState]s somewhere, then yes, they probably should check that scheduler.tasks[ts.key] is ts, not just ts.key in scheduler.tasks. (Same sort of change as in #6585, but with tasks instead of WorkerState.)

I'm not entirely convinced that simply swapping out the hash function is sufficient.

I don't claim that just swapping the hash function is sufficient to resolve all problems with re-submitted tasks and stale TaskState objects with the same key. I'm just saying it's more correct, and it would resolve problems that can occur when storing TaskStates in set-like objects.

It would fix #7504, and seems like the correct behavior—is there a reason not to do it?

@fjetter
Copy link
Member

fjetter commented Feb 8, 2023

What mappings hold TaskStates by key besides scheduler.tasks?

My argument is more general. I don't think distributed should use keys internally for most of what it is using them for right now. That's a bigger topic, of course.

It would fix #7504, and seems like the correct behavior—is there a reason not to do it?

Not pushing back on this. If you can produce a test for it, I'm happy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants