Skip to content

Commit

Permalink
Update the AWS_OFI_NCCL version and add in the MPI HWLOC install (#2651)
Browse files Browse the repository at this point in the history
* Update the AWS_OFI_NCCL version and add in the MPI HWLOC install

* Move the HWLOC down to the appropriate stage

* Move the HWLOC to the apt-get install

* Remove extra debug arg

---------

Co-authored-by: Mihir Patel <[email protected]>
  • Loading branch information
willgleich and mvpatel2000 authored Oct 19, 2023
1 parent d534e0a commit 174e238
Show file tree
Hide file tree
Showing 3 changed files with 16 additions and 6 deletions.
14 changes: 12 additions & 2 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ ARG MOFED_VERSION=5.5-1.0.3.2

# Version of EFA Drivers to install (for AWS Elastic Fabric Adapter support)
# Leave blank for no EFA Drivers
ARG AWS_OFI_NCCL_VERSION=v1.5.0-aws
ARG AWS_OFI_NCCL_VERSION=v1.7.3-aws

# Upgrade certifi to resolve CVE-2022-23491
ARG CERTIFI_VERSION='>=2022.12.7'
Expand Down Expand Up @@ -210,7 +210,7 @@ RUN if [ -z "$PYTORCH_NIGHTLY_URL" ] ; then \
torch==${PYTORCH_VERSION}.${PYTORCH_NIGHTLY_VERSION} \
torchvision==${TORCHVISION_VERSION}.${PYTORCH_NIGHTLY_VERSION} ; \
fi
RUN

#####################################
# Install EFA and AWS-OFI-NCCL plugin
#####################################
Expand All @@ -222,6 +222,16 @@ ENV LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/opt/amazon/openmpi/lib:/
ENV PATH=/opt/amazon/openmpi/bin/:/opt/amazon/efa/bin:$PATH
ENV FI_EFA_USE_DEVICE_RDMA=1

RUN if [ -n "$AWS_OFI_NCCL_VERSION" ] ; then \
apt-get update && \
apt-get install -y --no-install-recommends \
hwloc \
libhwloc-dev && \
apt-get autoclean && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* ; \
fi

RUN if [ -n "$AWS_OFI_NCCL_VERSION" ] ; then \
cd /tmp && \
curl -OsS https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
Expand Down
6 changes: 3 additions & 3 deletions docker/build_matrix.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
- mosaicml/pytorch:latest
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.16.0
- AWS_OFI_NCCL_VERSION: v1.5.0-aws
- AWS_OFI_NCCL_VERSION: v1.7.3-aws
BASE_IMAGE: nvidia/cuda:12.1.0-cudnn8-devel-ubuntu20.04
CUDA_VERSION: 12.1.0
IMAGE_NAME: torch-2-1-0-cu121-aws
Expand Down Expand Up @@ -54,7 +54,7 @@
- mosaicml/pytorch:2.0.1_cu118-python3.10-ubuntu20.04
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.15.2
- AWS_OFI_NCCL_VERSION: v1.5.0-aws
- AWS_OFI_NCCL_VERSION: v1.7.3-aws
BASE_IMAGE: nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
CUDA_VERSION: 11.8.0
IMAGE_NAME: torch-2-0-1-cu118-aws
Expand Down Expand Up @@ -93,7 +93,7 @@
- mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04
TARGET: pytorch_stage
TORCHVISION_VERSION: 0.14.1
- AWS_OFI_NCCL_VERSION: v1.5.0-aws
- AWS_OFI_NCCL_VERSION: v1.7.3-aws
BASE_IMAGE: nvidia/cuda:11.7.1-cudnn8-devel-ubuntu20.04
CUDA_VERSION: 11.7.1
IMAGE_NAME: torch-1-13-1-cu117-aws
Expand Down
2 changes: 1 addition & 1 deletion docker/generate_build_matrix.py
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ def _main():
if interconnect != 'EFA':
entry['AWS_OFI_NCCL_VERSION'] = ''
else:
entry['AWS_OFI_NCCL_VERSION'] = 'v1.5.0-aws'
entry['AWS_OFI_NCCL_VERSION'] = 'v1.7.3-aws'

pytorch_entries.append(entry)

Expand Down

0 comments on commit 174e238

Please sign in to comment.