-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4325] Use pssh. #80
Conversation
break; | ||
fi | ||
done | ||
echo "SSH-ing to all cluster nodes to approve keys..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is less a performance optimization and more just code trimming. Now that we check for SSH availability across the cluster in spark_ec2.py
, there is no need for the retry logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: Does spark_ec2.py now fail if one slave doesn't come up ? Sometimes when I launch a large number of machines (> 100) I run into cases where 1 or 2 machines just don't come up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I know @danosipov reported a similar issue.
With the SSH wait logic in spark_ec2.py
, we will just wait indefinitely if a slave never comes up. That's because we wait until SSH becomes available across all the nodes in the cluster before proceeding with setup.sh
.
Perhaps in the future we want to do something more intelligent, but for now I presume this is OK.
sleep 0.3 | ||
done | ||
ssh $SSH_OPTS localhost echo -n & | ||
ssh $SSH_OPTS `hostname` echo -n & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this line is for. Presumably, if we've SSHed to $MASTERS
and localhost
, we don't need hostname
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On EC2 there are two hostnames, one internal (of the form ip-127-1-1-1.ec2.internal) and one external (of the form ec2-54-227-51-123.compute-1.amazonaws.com) -- We typically pass in the latter in $MASTERS
and hostname
usually returns the former.
Even though we try to only use the external hostname in all our configs, it is better practice to approve keys for both hostnames
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll add it in to the pssh
version of the call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This looks good to me. But I still need to launch a cluster using this change just to verify things. Will merge this after doing that. |
I did notice one issue. We run |
Oh, derp. That will cost us 10 seconds. But wait, why do we delete |
There is a comment above it that [1] says we do it to get rid of old hosts which accumulate on start/stops. One solution is to just move that out from setup-slave.sh to setup.sh at the top (but then we'll need to do it on the slaves ?) |
Ah, I'm currently on a mobile device; didn't check the source. OK I'll revisit this later this week and see what to do. Maybe the old way will do fine for now. |
@shivaram How about simply approving SSH keys twice? Once at the beginning and once after |
Hmm this should be fine. Do you know how much latency this adds ? (I'm wondering if its a bad idea for large clusters like >100 machines etc.) |
I'll test it out with a 50 node cluster and report back.
|
I'm using a new AWS account and have lower instance limits than expected, so I can't spin up a 50- or 100-node cluster. Just submitted a request for a higher limit. Note that I intend to revert d9333af as soon as I get to this testing. |
@shivaram, the added latency is as follows:
I haven't tested with larger clusters, but it seems reasonable to extrapolate that for a cluster with 500 slaves, it would add at least 17 seconds of latency. Is this acceptable, or shall we go back to running |
Thanks for the measurements. One thing I am wondering is if we can remove the |
Hey @shivaram, I actually tried removing both instances where we pre-approve SSH keys and launching worked fine thanks to these options deactivating strict host checking. Is there any need for the pre-approval step? I think we can just take it out entirely, no? |
Hmm - you are right. The SSH_OPTS should take care of approving the keys the first time we ssh (will the first time be using rsync ?). I think the original reason for adding this was that if you tried to use say start-dfs.sh or start-all.sh in Hadoop/Spark, they expect SSH to work without any prompts. If it works without the need for pre-approval then it should be fine to remove it |
0a9a082
to
3a7f4ba
Compare
This reverts commit d9333af.
3a7f4ba
to
658d88c
Compare
Replace some bash-isms with pssh to neatly parallelize cluster operations.
Also, decrease questionably high sleep times.