-
-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve work stealing deadlock caused by race in move_task_confirm #5379
Conversation
New tests seem to be a bit brittle and there is a cythonization error. Will need to look into it |
0af8aba
to
762a0c2
Compare
Most of the test failures are unrelated
However, there appears to be a genuine failure in |
747fb59
to
6f9fe81
Compare
"released", | ||
None, | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these constant names generate confusion with Worker.status.
Could you change them e.g. to _TASKSTATE_STATE_CONFIRM etc.?
On a related note, this IMHO highlights that TaskState.state direly needs to become an Enum - it's just too easy to miss a state here, now or in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you change them e.g. to _TASKSTATE_STATE_CONFIRM etc.?
sure
this IMHO highlights that TaskState.state direly needs to become an Enum
agreed. I would prefer doing this in another PR if you don't mind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed this outstanding request. @fjetter would you mind pushing up a separate PR with the updated name @crusaderky suggested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. I would prefer doing this in another PR
Yes obviously
1d082fa
to
1e19654
Compare
1e19654
to
bb3899b
Compare
Test failures appear to be unrelated. There is again the cythonization error #5406 which I will address in a follow up PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fjetter for fixing and @crusaderky for reviewing
This refactors the move_task_confirm and resolves a few minor bugs
self.scheduler.remove_worker
after an unccesseful event called the function improperly (argument mismatch) and would raise either way. Removing the worker is not necessary since the batchedcomm takes care of this and trigger all relevant key reschedulingsThe most important bug is a deadlock caused by a concurrent race condition of stealing requests. I do not prohibit concurrent steals for the same key but rather make the system robust to concurrency if it should happen. The concurrency control I'm applying is an ETag like transaction confirmation using a stealing stimulus ID which is further passed through the transitions on worker side such that it can be traced. Every received response will now be logged other than before were certain code branches were ignored making debugging harder.
pre-commit run --all-files