Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pad-8: Upgrade to ROCM 5.6 and add multi-node support. #230

Merged
merged 7 commits into from
Jan 4, 2024
Merged

Conversation

will-HPE
Copy link
Contributor

@will-HPE will-HPE commented Nov 2, 2023

Description

note: this replaces a previous PAD-8 PR
feat: add multi-node support to the ROCM image.
build: added new targets to Makefile
build: significant changes to Dockefile-default-rocm
build: added new support scripts (build_aws_rocm.sh and ompi_rocm.sh)
build: large changes to scrape_libs.sh

testing on slurm:
an example command:
srun --mpi=pmi2 -n8 --ntasks-per-node=4 -p bard singularity run --pwd /var/tmp --no-mount=tmp --writable-tmpfs --bind /scratch/testuser/libfabric/:/det_libfabric --bind /:/det_host --rocm w/rocm-5.6-pytorch-2.0-tf-2.10-rocm-0.26.1.sif python /home/users/testuser/horovod/examples/pytorch/pytorch_synthetic_benchmark.py --batch-size=96 --fp16-allreduce --model=resnet50

further testing:
https://github.com/determined-ai/determined-ee/pull/1067

Checklist

  • Bump VERSION to make the pushed images are tagged with the right version.
  • Licenses should be included for new code which was copied and/or modified from any external code.
  • Test the images by running the test bumpenvs procedure in the determined repo. See README.

@will-HPE will-HPE self-assigned this Nov 2, 2023
@cla-bot cla-bot bot added the cla-signed label Nov 2, 2023
@will-HPE will-HPE changed the title Pad 8 Pad-8: Upgrade to ROCM 5.6 and add multi-node support. Nov 3, 2023
RUN if [ "$HOROVOD_PIP" != "0" ]; then pip install "${HOROVOD_PIP}" ; fi

RUN rm -r /tmp/*
RUN pip uninstall -y tb-nightly tensorboardX
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these are installed anywhere.

Copy link
Contributor Author

@will-HPE will-HPE Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a minimum tb-nightly is uninstalled.
Step 81/102 : RUN pip uninstall -y tb-nightly tensorboardX Successfully uninstalled tb-nightly-2.14.0a20230628

Copy link

cla-bot bot commented Jan 4, 2024

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @will-HPE on file. In order for us to review and merge your code, please start the CLA process at https://determined.ai/cla.

After we approve your CLA, we will update the contributors list (private) and comment @cla-bot[bot] check to rerun the check.

@cla-bot cla-bot bot removed the cla-signed label Jan 4, 2024
@will-HPE will-HPE merged commit 7287f56 into main Jan 4, 2024
0 of 2 checks passed
@will-HPE will-HPE deleted the PAD-8 branch January 30, 2024 18:45
will-HPE added a commit that referenced this pull request Jan 30, 2024
* Add docker_scripts; updated Makefile and Dockerfile with support for multinode execution.

* Add docker_scripts.

* large change in scrape_libs.sh to fix issues when multiple libfabric.so libs are present; and, fixed missing python libs.

* Created clean branch with all the changes need for ROCM 5.6 multi-node execution

* Bumped version; updated CircleCI config targets.

* removed 'a few extraneous comments.'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants