You can install all required libraries using the following command:
conda env create -f environment.yml
Data is available at the 100 Bird Species Dataset.
The scripts train_singlenode.sh and train_singlenode.py demonstrate running multiple training tasks in parallel with distinct parameters. Each task operates independently on a separate GPU. Execute the example using the following command:
sbatch train_singlenode.sh
Details for each srun
experiment can be found at Run 1 and Run 2. The expected training time for 10 epochs is approximately 30 minutes on Nvidia V100s GPU.
Data Parallelization on a Single Node using PyTorch DataParallel (constrained)
The scripts train_singlenode_dp.sh and train_singlenode_dp.py illustrate data parallelization using the DataParallel approach. PyTorch's DataParallel replicates the model across available GPUs within a single machine, dividing input data into smaller batches for parallel processing on each GPU. It independently computes gradients on these batches, aggregates them, and synchronizes the model's parameters across all GPUs after each update. Run the example using the following command:
sbatch train_singlenode_dp.sh
Details are available at Run. The expected training time for 10 epochs is approximately 20 minutes on 2 Nvidia V100s GPUs.
The scripts train_multinode_ddp.sh and train_multinode_ddp.py illustrate data parallelization using the DistributedDataParallel approach. DistributedDataParallel allows efficient parallelization across multiple nodes. It distributes the workload among different nodes, leveraging collective communication operations to manage gradient synchronization and model parameter updates efficiently. This approach significantly reduces the communication overhead between nodes, enabling faster training and scaling for larger models and datasets.
Utilizing PyTorch's DistributedDataParallel incurs less overhead compared to DataParallel. Execute the following command to run the example on multiple nodes:
sbatch train_multinode_ddp.sh
Refer to the experiment's details at Run. The expected training time is approximately 10 minutes on 4 Nvidia V100s GPUs.