-
-
Notifications
You must be signed in to change notification settings - Fork 725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2 machines, timed out after X seconds #6768
Comments
@windowshopr |
It is? It's the latest stable version that gets installed when I run
How do I get a different version? What command should I run? |
Cancel that, I see |
GOOD NEWS!!!! 😄 haha the update helped. Though it was a bit of process, but I'll reproduce all my steps here for others in a similar boat in the future. I was running Python 3.7.9, hence the install not grabbing the latest version of
NOW, I followed the same steps as before, only this time for testing, I did NOT set a STEPS TO REPRODUCE
So effectively, Machine 1 connects to itself (the scheduler), and Machine 2 connects to machine 1's scheduler. When inspecting http://localhost:8787/status, I see both workers connected and ready to go.
When I run the sample code in VS Code, both machines spin up like normal (albeit much faster this time, not much waiting 👍), and I see outputs in the worker PowerShell windows from sklearn, so the first TPOT generation is running on both machines, which is good. And after a few minutes of scheduling coordinating some results, I see progress in the WOOHOO!!! So now I'm running a LAN cluster for my TPOT run! Thanks @gjoseph92 for the direction! Closing issue now. |
I opened a separate issue #6731 there by accident as that’s a slightly separate issue, so was advised to open it here. I took some time to make it easy to read and reproduce with a minimal reproducible code.
I'm having this issue currently. I can provide a reproducible code, and some details about my setup. This is an extension of my question asked on SO here, however I figured out that question, now stuck on this
Timed out
issue.PROBLEM DESCRIPTION
I'm trying to utilize a distributed cluster across 2 machines for a TPOT machine learning run. I do not think this is an issue with TPOT but rather with distributed. I'm able to setup my scheduler, and have both machines connect to it and the TPOT run starts, but after 2 minutes I get a timed out error on one of the workers, even though both machines are still processing something.
ENVIRONMENT
NETWORK MAP
Machine 1 (Main) - Local IP Address: 172.16.1.113
Machine 2 (Secondary) - Local IP Address: 172.16.1.82
On both machines, I have opened inbound/outbound rules for ports 8786, and 8789-8795 for communication purposes/to try and mitigate any dumb Windows firewall issues.
STEPS TO REPRODUCE
dask-scheduler
to start the scheduler.dask-worker tcp://172.16.1.113:8786 --nprocs 1 --nthreads 4 --worker-port=8789
dask-worker tcp://172.16.1.113:8786 --nprocs 1 --nthreads 4 --worker-port=8789
So effectively, Machine 1 connects to itself (the scheduler), and Machine 2 connects to machine 1's scheduler. When inspecting
http://localhost:8787/status
, I see both workers connected and ready to go.n_samples=5000
):After a few minutes, both machines get spun up and resources are being used. Here is from Machine 1 (main):
Now, after 2 minutes (as that is what I set in the
distributed.yaml
config file), on Machine 1's worker PowerShell window (NOT the scheduler window), I get the following traceback:The timed out message comes back every 2 minutes from then on. Both workers continue working for a while (several minutes, doing the TPOT machine learning stuff), then they spin down, and when I check the TPOT output, it looks like nothing has happened that whole time...
So it just hangs, the scheduler is still going, both workers are still showing up in the dashboard, but the worker PowerShell window on Machine 1 just keeps repeating the timed out message and the TPOT run doesn't progress.
ACTIONS ALREADY TAKEN
60s
timeout in the config file, which I increased to120s
and I COULD increase more, however I'm not 100% sure this is the issue.8789
. Have also tried setting 1 worker's port to8789
and the other to8790
to mitigate same port issues.I hope this has been detailed enough for reproducibility. I know I'm running TPOT here, but I think it's a distributed issue re: connections timing out in the cluster.
As you can see, I'm sort of new to using distributed with TPOT and running it from the command line/PowerShell, however the only guide on it shows only how to run a local cluster (on one machine) not multiple machines, nor how to help with connections timing out like this. I also referenced this page for these issues, however I've already changed the config file and am running all upgraded dependencies.
Would love some input on this!!! I want to crank up both machines to run my TPOT! Thanks!!!!
The text was updated successfully, but these errors were encountered: