Skip to content

Training of convolutional neural networks on the ISIC 2017 skin cancer challenge dataset, with integrated hyperparameter optimization and results visualization.

Notifications You must be signed in to change notification settings

YPBConnectPlatform/ypb_machine_learning

Repository files navigation

Training and hyperparamter optimization for neural networks on the ISIC 2017 dataset

Training neural networks on the ISIC 2017 skin cancer dataset using convolutional neural network (CNN)-based transfer learning in PyTorch and the HpBandSter hyperparameter optimization model.

https://aws.amazon.com/blogs/machine-learning/scalable-multi-node-training-with-tensorflow/ is a great resource for multi-node, multi-GPU training.

Running environment

These files have been tested:

  • On Amazon AWS p2.xlarge(8xlarge;16xlarge) instances
  • Using the following AMI: ami-0a8511d6f2ba0cfc9 generated by me and available on request. This AMI contains all (toy) data, scripts, configuration, conda environments, etc. etc. to run multi-node, multi-GPU training and single-node, multi-GPU training.
  • Using the conda environment "derm-ai" that comes preloaded on the AMI. Key points are that DALI 0.6.1 (and not DALI 0.7.0) is used, as DALI 0.7.0 led to intolerably slow training for some reason. Another key point is that my Horovod training scripts are not yet optimized.

"Cold" Install -- applies to all nodes.

  • Start AWS p2.xlarge (or .8xlarge or .16xlarge) EC2 instances using the AMI mentioned. Make sure that port 8888 is accessible in your security group (Jupyter notebooks use port 8888 by default).
  • You need to create a security group that will be replicated across all of the nodes -- key points are that SSH needs to be accessible from any IP and that all TCP ports should be open to the entire security group.

"Cold" Install -- applies to leader node only.

  • Next, run the following:
cd ~/src/derm-ai
vim hosts
  • Add the following line to hosts -- localhost slots=<# GPUs> where # GPUs is the number of GPUs on the lead node. For each instance you want to use, add a line to hosts that looks like <Amazon private IP> slots=<# GPUs>

  • Run the following on your local machine to copy over the private key file you use to SSH into the lead instance. scp -i /path/to/yourkey.pem /path/to/yourkey.pem [email protected]:~/.ssh

  • Then, on the lead instance, run the following to put the private key in the appropriate format and copy it to all other instances. Note that before doing this, you may need to remove the private key (~/.ssh/id_rsa) from the instance, as it's the key I used in my toy work.

mv ~/.ssh/yourkey.pem ~/.ssh/id_rsa
chmod 400 ~/.ssh/id_rsa
function runclust(){ while read -u 10 host; do host=${host%% slots*}; scp -i ~/.ssh/id_rsa ~/.ssh/id_rsa ubuntu@$host:~/.ssh && ssh $host chmod 400 ~/.ssh/id_rsa; done 10<$1; };
runclust hosts

You will see an "scp: permission denied" error wherever the key is already present (like localhost, in this case).

  • Then, run the following:
function runclust(){ while read -u 10 host; do host=${host%% slots*}; ssh -o "StrictHostKeyChecking no" $host ""$2""; done 10<$1; };
runclust hosts "echo \"StrictHostKeyChecking no\" >> ~/.ssh/config"

Warm install.

  • Assuming (1) every node has the image data, Python training scripts, private keys, conda environments etc., (2) that all the above code has been run, (3) that the lead node has the 'hosts' file in ~/src/derm-ai and is accurate and (4) that the nodes have just been started up from a stopped state, then run the following on the lead node to get everything fired up:
conda activate derm-ai
cd ~/src/derm-ai
./train.sh <# GPUs to use>
  • On the lead node, run the following to run the training Python script, if you'd rather not use shell scripting:
mpirun -np <total # GPUs> -hostfile ~/src/derm-ai/hosts -mca plm_rsh_no_tree_spawn 1 \
	-bind-to socket -map-by slot \
	-x HOROVOD_HIERARCHICAL_ALLREDUCE=1 -x HOROVOD_FUSION_THRESHOLD=16777216 \
	-x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib \
	-x NCCL_SOCKET_IFNAME=ens3 -mca btl_tcp_if_exclude lo,docker0 \
	-x TF_CPP_MIN_LOG_LEVEL=0 \
	python -W ignore ~/src/derm-ai/DermAI_train_horovod.py

About

Training of convolutional neural networks on the ISIC 2017 skin cancer challenge dataset, with integrated hyperparameter optimization and results visualization.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published