Skip to content

Running Intel PAT on Dataproc

jamesemery edited this page Aug 30, 2016 · 1 revision

Intel PAT is a cluster profiling tool that generates useful profiling information about the executors running for a particular job for hadoop. Information about the tool as well as a download link can be found here: https://github.com/intel-hadoop/PAT/tree/master/PAT

In order to install PAT to a cluster you want to first make sure that when you start up your cluster that the checkbox for "Allow API access to all Google Cloud services in the same project" is checked as it can cause issues. Next, you will need to install PAT onto the cluster, which requires you to first SSH into the master node of the cluster the following command for gcloud (This command can also be found in the gcloud API if you click on your cluster from the Dataproc cluster interface and go to the "VM Instances" tab. Just click on the button next to SSH next to the master machine and click "view gcloud command"):

$ gcloud compute --project "my-project-name" ssh --zone "my-project-zone" "my-cluster-m"

Before installing PAT you will need to update its dependancies on EACH machine in the cluster by running the following commands while connected by SSH (NOTE: you can ssh into each machine in the cluster by the same command you used to get into the master cluster by changing the last argument to the name of your worker node):

$ apt-get install perf-tools-unstable

$ apt-get install gawk

$ apt-get install sysstat

Additionally, PAT requires that the perf-event-paranoid variable is set to 0 on each machine in the cluster as well. In order to do this you must become a root first by using sudo su then you must run the following command on each machine:

$ echo 0 /proc/sys/kernel/perf_event_paranoid

Once all of these steps have been taken, you will want to log onto the master node once again. Run sudo su to become root and then use the following gcloud command to create an SSH key which should automatically allow master to connect with each of the worker nodes (NOTE: you will want to do this as whichever identity you intend to run PAT with):

$ gcloud compute ssh

Do not enter a passphrase for the ssh key as it will not work with PAT. Once this is done, you should navigate to a suitable directory and clone this repository onto the master machine. In your PAT directory you should find a file named "./config.template" This should be renamed to "./config." In this file, you have to set the "ALL_NODES" to list every node in the cluster (including the master node) and its corresponding ssh port in the "machine:22" format. Next you will want to set the "SSH_KEY:" line to point to wherever your gcloud ssh key lives in as an absolute path, which by default will be located in ~/.ssh/google_compute_engine.

Once this has been done you should run the command $ ./pat install. If everything worked then this should print a message that it is installing PAT on each machine. In order to start running jobs, you will want to edit the "CMD_PATH" line in the config file to either point to script that launches the job or a commmand to run (eg. a gatk-launch command). After this is done you should be able to run jobs using PAT by running $ ./pat run. I recommend running a spark-submit job using yarn-client as master.

If everything worked, then PAT will output a folder into the ./results directory with subdirectories corresponding to each worker node. In order to read these results you will need to transfer this folder onto a Windows machine with Microsoft Office and open the spreadsheet in the "instruments" directory which has a series of macros that can be activated by hitting Ctrl-q which should construct the spreadsheet from the corresponding data.