Skip to content

AWS Elastic MapReduce (EMR) Cluster Setup

luongthevinh edited this page Mar 6, 2016 · 28 revisions

Elastic MapReduce (EMR) is a large-scale distributed computing product of Amazon Web Services (AWS), built on open-source technologies such as Apache Hadoop and Apache Spark.

If unaided, you'll find launching a cluster of servers with your favorite software installed involves a significant amount of suffering. Fortunately, your devoted Chicago Booth Analytics co-chairs have gone through hellfire to get this done for you, and you can now benefit from the fruits of our labor.

Follow the below instructions to launch your AWS EMR cluster.

Install Supporting Software

  1. Install Git and related version-control software;

  2. Git-clone Chicago Booth Analytics's Software GitHub repo down to a local folder on your machine;

  3. Install Anaconda Python v2.7 distribution;

  4. Install the AWS Command-Line Interface;

  5. Install the dos2unix line-ending-editing utility.

Register AWS Account & Set Up Supporting Infrastructure

Perform the following steps one time only:

  1. Register an AWS account: if you have shopped with Amazon before (if not yet, have you been away from Earth?), you are likely to be able to use the same account credentials to register with AWS;

  2. Create AWS Access Keys: go here [https://console.aws.amazon.com/iam/home#security_credential] to create a new Access Key, which is a pair of "Access Key ID" and "Secret Access Key" strings; you are able to download a text file containing such string pairs for safekeeping;

  3. Configure your AWS command-line access with your access keys: after you have installed the AWS Command-Line Interface mentioned above, open a new command-line terminal and run command aws configure; among various things, you'll be asked to specify the following:

    • Access Key ID: use the relevant field from the Access Key you've set up
    • Secret Access Key: use the relevant field from the Access Key you've set up
    • Region: enter us-west-1
    • Default Output Format: choose json
  4. Set up an S3 storage bucket: go to the AWS Simple Storage Service (AWS S3) console and create a new storage "bucket" in the Northern California region;

    • after your S3 bucket is created, enter it and create a folder named zzzLogs
  5. Set up an EC2 key pair named "keypair": go to the Northern California region for the Elastic Compute Cloud (EC2) service and create a new security key pair named "keypair"; after the key pair is created, you'll be asked by AWS to download a file keypair.pem. Download it and keep it in a safe folder of your choice. Then copy it and paste it into the folder <path to your cloned Software folder>/AWS/EMR on your machine;

  6. Create "Default Access Roles" for AWS EMR:

    • Open a shell-script command-line terminal
      • Mac: the default terminal;
      • Windows: please use the Git Bash terminal that ships with Git – don't use the Windows terminal;
    • Run command: aws emr create-default-roles
  7. Create an Inbound Rule to allow accessing the AWS EMR master server through the command-line:

Bid for your AWS EMR Cluster

You are now ready to bid for (i.e. request to rent an AWS EMR cluster).

  • Open a shell-script command-line terminal

    • Mac: the default terminal;
    • Windows: please use the Git Bash terminal that ships with Git – don't use the Windows terminal;
  • Navigate to the folder <path to your cloned Software folder>/AWS/EMR on your machine;

  • Run the following command: sh create.sh -b <Your-S3-Bucket-Name> -m <Master-Server-Type> -p <Hourly-Bid-Price-For-Master-Server> -n <Number-Of-Worker-Servers> -t <Worker-Server-Type> -q <Hourly-Bid-Price-For-Each-Worker-Server> -r "<any remarks>"

    • Example 1: to launch a typical cluster of 1 Master + 2 Workers at price $0.050/server/hour, we can run: sh create.sh -b <my-s3-bucket-in-cali> -m m3.xlarge -p 0.050 -n 2 -t m3.xlarge -q 0.050 -r "a normal cluster with M3.xLarge servers"
    • Example 2: to launch a specialized cluster of 1 Master of type G2.2xLarge (one with an expensive Graphics Processing Unit) at price $0.100/hour + 6 normal Workers at price $0.050/server/hour, we can run: sh create.sh -b <my-s3-bucket-in-cali> -m g2.2xlarge -p 0.100 -n 6 -t m3.xlarge -q 0.050 -r "a large cluster with GPU-powered master server"
  • you can then go to the Northern California AWS EMR management console to check the status of your cluster;

  • note that you have to bid sufficiently high prices to get AWS to rent you the servers. If the management console shows your cluster status being "Provisioning" for more than 12 minutes, it is likely that your bid prices are too low. Terminate that cluster request and use the command-line to request for a new cluster at higher prices;

    • based on our experience, in the Northern California AWS region, we can get M3.xLarge servers for $0.031 – $0.052 / server / hour, and G2.2xLarge servers for $0.091 – $0.130 / server / hour.

Connect to, and use, the AWS EMR cluster

When the AWS EMR cluster has passed the "Provisioning" and "Bootstrapping" states and entered the "Running" or the subsequent "Waiting" state – it usually takes at least 30 minutes to enter either of these 2 states – you can connect to the cluster and perform Python programming through a Jupyter notebook interface.

In the AWS EMR console, copy the "Master public DNS" (e.g., "ec2-54-183-23-201.us-west-1.compute.amazonaws.com"), and:

  • run the following command in the same <path to your cloned Software folder>/AWS/EMR folder through either the Mac command-line terminal, or the Git Bash command-line terminal on Windows: sh connect -d <Master-Public-DNS>
  • when asked "Are you sure you want to continue connecting (yes/no)?", accept by entering "yes"
  • open a web browser and go to address: localhost:8133; this will allow you to program through a Jupyter notebook environment on the AWS EMR cluster;
  • when you are done – and have downloaded you script files & notebooks down to your local computer for keeping! – go to the AWS EMR management console to terminate the cluster.
Clone this wiki locally