Training neural networks on the ISIC 2017 skin cancer dataset using convolutional neural network (CNN)-based transfer learning in PyTorch and the HpBandSter hyperparameter optimization model.
https://aws.amazon.com/blogs/machine-learning/scalable-multi-node-training-with-tensorflow/ is a great resource for multi-node, multi-GPU training.
These files have been tested:
- On Amazon AWS p2.xlarge(8xlarge;16xlarge) instances
- Using the following AMI: ami-0a8511d6f2ba0cfc9 generated by me and available on request. This AMI contains all (toy) data, scripts, configuration, conda environments, etc. etc. to run multi-node, multi-GPU training and single-node, multi-GPU training.
- Using the conda environment "derm-ai" that comes preloaded on the AMI. Key points are that DALI 0.6.1 (and not DALI 0.7.0) is used, as DALI 0.7.0 led to intolerably slow training for some reason. Another key point is that my Horovod training scripts are not yet optimized.
- Start AWS p2.xlarge (or .8xlarge or .16xlarge) EC2 instances using the AMI mentioned. Make sure that port 8888 is accessible in your security group (Jupyter notebooks use port 8888 by default).
- You need to create a security group that will be replicated across all of the nodes -- key points are that SSH needs to be accessible from any IP and that all TCP ports should be open to the entire security group.
- Next, run the following:
cd ~/src/derm-ai
vim hosts
-
Add the following line to hosts --
localhost slots=<# GPUs>
where # GPUs is the number of GPUs on the lead node. For each instance you want to use, add a line to hosts that looks like<Amazon private IP> slots=<# GPUs>
-
Run the following on your local machine to copy over the private key file you use to SSH into the lead instance.
scp -i /path/to/yourkey.pem /path/to/yourkey.pem [email protected]:~/.ssh
-
Then, on the lead instance, run the following to put the private key in the appropriate format and copy it to all other instances. Note that before doing this, you may need to remove the private key (~/.ssh/id_rsa) from the instance, as it's the key I used in my toy work.
mv ~/.ssh/yourkey.pem ~/.ssh/id_rsa
chmod 400 ~/.ssh/id_rsa
function runclust(){ while read -u 10 host; do host=${host%% slots*}; scp -i ~/.ssh/id_rsa ~/.ssh/id_rsa ubuntu@$host:~/.ssh && ssh $host chmod 400 ~/.ssh/id_rsa; done 10<$1; };
runclust hosts
You will see an "scp: permission denied" error wherever the key is already present (like localhost, in this case).
- Then, run the following:
function runclust(){ while read -u 10 host; do host=${host%% slots*}; ssh -o "StrictHostKeyChecking no" $host ""$2""; done 10<$1; };
runclust hosts "echo \"StrictHostKeyChecking no\" >> ~/.ssh/config"
- Assuming (1) every node has the image data, Python training scripts, private keys, conda environments etc., (2) that all the above code has been run, (3) that the lead node has the 'hosts' file in ~/src/derm-ai and is accurate and (4) that the nodes have just been started up from a stopped state, then run the following on the lead node to get everything fired up:
conda activate derm-ai
cd ~/src/derm-ai
./train.sh <# GPUs to use>
- On the lead node, run the following to run the training Python script, if you'd rather not use shell scripting:
mpirun -np <total # GPUs> -hostfile ~/src/derm-ai/hosts -mca plm_rsh_no_tree_spawn 1 \
-bind-to socket -map-by slot \
-x HOROVOD_HIERARCHICAL_ALLREDUCE=1 -x HOROVOD_FUSION_THRESHOLD=16777216 \
-x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib \
-x NCCL_SOCKET_IFNAME=ens3 -mca btl_tcp_if_exclude lo,docker0 \
-x TF_CPP_MIN_LOG_LEVEL=0 \
python -W ignore ~/src/derm-ai/DermAI_train_horovod.py