-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task stuck in "processing" on closed worker #6263
Comments
One other thing, there are a bunch of these KeyErrors in the scheduler logs
not for the specific worker I chose above, but for some of the other 🧟s |
@gjoseph92 do you have time this week to take a look at this? Somehow the scheduler has assigned a task to run on a closed WorkerState |
Thanks @bnaul, I'll take a look this week. |
@bnaul to confirm, what version of |
I think that Brett has been on main for a little while (or main whenever
this was)
…On Mon, May 16, 2022 at 5:31 PM Gabe Joseph ***@***.***> wrote:
@bnaul <https://github.com/bnaul> to confirm, what version of distributed
is this?
—
Reply to this email directly, view it on GitHub
<#6263 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTH5AEMVZKM7Y2RJBSDVKLECPANCNFSM5U65NLKA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
That's what I thought too, and why I wanted to know. There have been a lot of changes recently to the areas that this touch, so getting a specific commit would be helpful. |
Dumping a ton of notes here from investigation. There are two major bugs here. I'll eventually open tickets for each. 1. After the scheduler tells a worker to close, it may reconnect to the scheduler.This has a bit of overlap with #5480, and sort of #6341. I think that #6329 might alleviate one side, but there's a more general race condition that's probably been around for a long time.
DiscussionFinding yet another bug in worker reconnection might feel like another vote for #6350. The real bug here is how It shouldn't be that hard to fix. (It may be harder to ensure it doesn't get broken again in the future, though—the design is still brittle.) |
2. If a worker reconnects, a task can be stolen to the
|
Thanks @gjoseph92, super interesting! And yes I believe I had just bumped to 2022.5.0 but if not it was main from probably the day of release. |
Hey @bnaul, could you try things out with the latest release (2022.5.1)? The buggy codepath of #6356 / #6392 is still around, but with worker reconnect removed (#6361), I think it's much, much less likely to be triggered. I'd be surprised if you still see this issue anymore. Though I'd like to keep it open until #6356 is actually fixed. |
I'm still seeing an issue like this for the May, April, and March releases of Dask and Distributed. about ~300 tasks are processed across my 32 core machine and I see on htop that cores are firing 100%. Then core use goes down to near 0 and tasks are stuck in "processing" I'm testing by reinstalling both dask and distributed from pip, and wasn't encountering this error back in April. is there something that would cause this error to occur across dask versions if it was run with the version with the bug? EDIT: bringing |
@rbavery generically, what you're describing sounds like a deadlock, but there's not enough information here to be sure that you're seeing the deadlock referenced in this particular issue. (There are a number of different bugs which can cause deadlocks; the overall symptom of "tasks stuck in processing" is the same.) Could you open a new issue for this, and include any logs from the scheduler and workers you can get? Also, could you make sure you're using the latest version (2022.5.1 or 2022.5.2)? We fixed one of the more common sources of deadlocks in the release 2 days ago. (Though if you're running a local cluster, I would not expect this one to affect you.) |
Anything interesting in logs? If you're running with a local cluster you might want the flag client = Client(silence_logs=False) |
Thanks for the tips @gjoseph92 and @mrocklin . I tested 2022.5.2 but still ran into this. I'll open a new issue. |
@gjoseph92 IIUC we currently assume this should be closed by #6356? |
By whatever fixes #6356, yes |
Similar at a high level to #6198 but a slightly different manifestation: dashboard shows 9 remaining tasks (one is a parent task that spawned the other 8 by calling
dd.read_parquet
), but the Info page shows only the one parent task processing.In the case of #6198 the worker showed up in the scheduler Info page (but would 404 when you tried to click through to its info); here the scheduler knows the workers are gone, but there are still tasks assigned to them anyway:
Zooming in on the first closed worker
10.126.160.29:33011
, relevant scheduler logs:And worker logs:
Also finally got a successful cluster dump 🎉
cc @gjoseph92 @fjetter @crusaderky @mrocklin
The text was updated successfully, but these errors were encountered: