-
Notifications
You must be signed in to change notification settings - Fork 299
Include Python 3 in pre-baked AMI #74
Comments
I'm not really familiar with what needs to be done to make Spark use Python 3. cc @nchammas who might know more. |
Getting Spark to use Python 3 is generally a simple matter of setting Unfortunately, at this point I think it's on you to use something like We've discussed updating the spark-ec2 AMIs for some time now, but it's a lot of manual work and previous efforts to automate the process fizzled out. It's one of the reasons why I made Flintrock. Flintrock doesn't depend on any custom AMIs, and it offers a So looking at the big-picture, I think it would be beneficial to all if spark-ec2 did some combination of the following:
|
Thanks @nchammas - The third option of adding a The question of AMIs - I have spent some time thinking about it and long term it looks like there are two issues we will hit in the long run (a) manual labor involved in creating new AMIs (b) the cost overhead of storing those AMIs especially as we have one for each region, HVM, PVM etc. While automation can solve part of the problem, it'll still be good to have zero effort if possible. So your second point of decoupling spark-ec2 from custom AMIs sounds better for that. I guess the main concern then is how long does it take to install all the necessary tools. @nchammas Have you found this to not be a significant overhead in flintrock ? |
Flintrock clusters have fewer out-of-the-box tools compared to spark-ec2. We don't install ganglia, for example. So the launch burden is lower, and it frees Flintrock to use the generic Amazon Linux AMIs. Currently, Flintrock defaults to install Java 8 (if not already detected on the hosts) and Spark. You can flip a switch and have Flintrock also install HDFS. That's pretty much it. The overhead of installing these 3 things at launch time isn't large, especially if you configure HDFS to download from S3 and not the Apache mirrors, which can be very slow. (Spark by default downloads from S3.) Flintrock can generally launch 100+ node clusters in under 5 minutes. |
I see - The thing I was looking at is the old packer scripts that are part of the issue you linked above[1,2]. Certainly those have a large number of install steps and I guess we'll need to benchmark things to figure out how long they will take. I wonder if we can remove support for a bunch of them and make them optional via the run-command (i.e. we have a python-setup.sh in the spark-ec2 tree and somebody can add that to their command line if they want to use python etc.) [1] https://github.com/nchammas/spark-ec2/blob/packer/image-build/tools-setup.sh#L1 |
That sounds like a reasonable approach to me. |
This would be very, very nice in 2016/2017. Or at least provide some instructions in README on how to do so.
The text was updated successfully, but these errors were encountered: