-
Notifications
You must be signed in to change notification settings - Fork 8
AWS Elastic MapReduce (EMR) Cluster Setup
Elastic MapReduce (EMR) is a large-scale distributed computing product of Amazon Web Services (AWS), built on open-source technologies such as Apache Hadoop
and Apache Spark
.
If unaided, you'll find launching a cluster of servers with your favorite software installed involves a significant amount of suffering. Fortunately, your devoted Chicago Booth Analytics co-chairs have gone through hellfire to get this done for you, and you can now benefit from the fruits of our labor.
Follow the below instructions to launch your AWS EMR cluster.
-
Git
-clone Chicago Booth Analytics'sSoftware
GitHub repo down to a local folder on your machine; -
Install
Anaconda Python
v2.7 distribution; -
Install the AWS Command-Line Interface;
-
Install the
dos2unix
line-ending-editing utility.
Perform the following steps one time only:
-
Register an AWS account: if you have shopped with Amazon before (if not yet, have you been away from Earth?), you are likely to be able to use the same account credentials to register with AWS;
-
Create AWS Access Keys: go here [https://console.aws.amazon.com/iam/home#security_credential] to create a new Access Key, which is a pair of "Access Key ID" and "Secret Access Key" strings; you are able to download a text file containing such string pairs for safekeeping;
-
Configure your AWS command-line access with your access keys: after you have installed the AWS Command-Line Interface mentioned above, open a new command-line terminal and run command
aws configure
; among various things, you'll be asked to specify the following:- Access Key ID: use the relevant field from the Access Key you've set up
- Secret Access Key: use the relevant field from the Access Key you've set up
- Region: enter us-west-1
- Default Output Format: choose json
-
Set up an
S3
storage bucket: go to the AWS Simple Storage Service (AWS S3) console and create a new storage "bucket" in the Northern California region;- after your S3 bucket is created, enter it and create a folder named zzzLogs
-
Set up an
EC2
key pair named "keypair": go to the Northern California region for the Elastic Compute Cloud (EC2) service and create a new security key pair named "keypair"; after the key pair is created, you'll be asked by AWS to download a filekeypair.pem
. Download it and keep it in a safe folder of your choice. Then copy it and paste it into the folder<path to your cloned Software folder>/AWS/EMR
on your machine; -
Create "Default Access Roles" for AWS EMR:
- Open a shell-script command-line terminal
- Mac: the default terminal;
-
Windows: please use the
Git Bash
terminal that ships withGit
– don't use the Windows terminal;
- Run command:
aws emr create-default-roles
- Open a shell-script command-line terminal
-
Create an Inbound Rule to allow accessing the AWS EMR master server through the command-line:
- Visit the Northern California region for the Elastic Compute Cloud (EC2) service > Security Groups
- Find and click on the security group
ElasticMapReduce-master
- In the Inbound tab underneath, click on Edit
- Add the following inbound security rule, or make sure one such already exists:
- Type: SSH
- Protocol: TCP
- Port Range: 22
- Source: Anywhere (0.0.0.0/0)
You are now ready to bid for (i.e. request to rent an AWS EMR cluster).
-
Open a shell-script command-line terminal
- Mac: the default terminal;
-
Windows: please use the
Git Bash
terminal that ships withGit
– don't use the Windows terminal;
-
Navigate to the folder
<path to your cloned Software folder>/AWS/EMR
on your machine; -
Run the following command:
sh create.sh
-b
<Your-S3-Bucket-Name>
-m
<Master-Server-Type>
-p
<Hourly-Bid-Price-For-Master-Server>
-n
<Number-Of-Worker-Servers>
-t
<Worker-Server-Type>
-q
<Hourly-Bid-Price-For-Each-Worker-Server>
-r
"<any remarks>"
-
Example 1: to launch a typical cluster of 1 Master + 2 Workers at price $0.050/server/hour, we can run:
sh create.sh
-b
<my-s3-bucket-in-cali>
-m
m3.xlarge
-p
0.050
-n
2
-t
m3.xlarge
-q
0.050
-r
"a normal cluster with M3.xLarge servers"
-
Example 2: to launch a specialized cluster of 1 Master of type G2.2xLarge (one with an expensive Graphics Processing Unit) at price $0.100/hour + 6 normal Workers at price $0.050/server/hour, we can run:
sh create.sh
-b
<my-s3-bucket-in-cali>
-m
g2.2xlarge
-p
0.100
-n
6
-t
m3.xlarge
-q
0.050
-r
"a large cluster with GPU-powered master server"
-
Example 1: to launch a typical cluster of 1 Master + 2 Workers at price $0.050/server/hour, we can run:
-
you can then go to the Northern California AWS EMR management console to check the status of your cluster;
-
note that you have to bid sufficiently high prices to get AWS to rent you the servers. If the management console shows your cluster status being "Provisioning" for more than 12 minutes, it is likely that your bid prices are too low. Terminate that cluster request and use the command-line to request for a new cluster at higher prices;
- based on our experience, in the Northern California AWS region, we can get
M3.xLarge
servers for $0.031 – $0.052 / server / hour, andG2.2xLarge
servers for $0.091 – $0.130 / server / hour.
- based on our experience, in the Northern California AWS region, we can get
When the AWS EMR cluster has passed the "Provisioning" and "Bootstrapping" states and entered the "Running" or the subsequent "Waiting" state – it usually takes at least 30 minutes to enter either of these 2 states – you can connect to the cluster and perform Python
programming through a Jupyter
notebook interface.
In the AWS EMR console, copy the "Master public DNS" (e.g., "ec2-54-183-23-201.us-west-1.compute.amazonaws.com"), and:
- run the following command in the same
<path to your cloned Software folder>/AWS/EMR
folder through either the Mac command-line terminal, or theGit Bash
command-line terminal on Windows:sh connect -d <Master-Public-DNS>
- when asked "Are you sure you want to continue connecting (yes/no)?", accept by entering "yes"
- open a web browser and go to address:
localhost:8133
; this will allow you to program through aJupyter
notebook environment on the AWS EMR cluster; - when you are done – and have downloaded you script files & notebooks down to your local computer for keeping! – go to the AWS EMR management console to terminate the cluster.