-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pad-8: Upgrade to ROCM 5.6 and add multi-node support. #230
Conversation
…multinode execution.
…so libs are present; and, fixed missing python libs.
RUN if [ "$HOROVOD_PIP" != "0" ]; then pip install "${HOROVOD_PIP}" ; fi | ||
|
||
RUN rm -r /tmp/* | ||
RUN pip uninstall -y tb-nightly tensorboardX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these are installed anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a minimum tb-nightly is uninstalled.
Step 81/102 : RUN pip uninstall -y tb-nightly tensorboardX Successfully uninstalled tb-nightly-2.14.0a20230628
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have the users @will-HPE on file. In order for us to review and merge your code, please start the CLA process at https://determined.ai/cla. After we approve your CLA, we will update the contributors list (private) and comment |
* Add docker_scripts; updated Makefile and Dockerfile with support for multinode execution. * Add docker_scripts. * large change in scrape_libs.sh to fix issues when multiple libfabric.so libs are present; and, fixed missing python libs. * Created clean branch with all the changes need for ROCM 5.6 multi-node execution * Bumped version; updated CircleCI config targets. * removed 'a few extraneous comments.'
Description
note: this replaces a previous PAD-8 PR
feat: add multi-node support to the ROCM image.
build: added new targets to Makefile
build: significant changes to Dockefile-default-rocm
build: added new support scripts (build_aws_rocm.sh and ompi_rocm.sh)
build: large changes to scrape_libs.sh
testing on slurm:
an example command:
srun --mpi=pmi2 -n8 --ntasks-per-node=4 -p bard singularity run --pwd /var/tmp --no-mount=tmp --writable-tmpfs --bind /scratch/testuser/libfabric/:/det_libfabric --bind /:/det_host --rocm w/rocm-5.6-pytorch-2.0-tf-2.10-rocm-0.26.1.sif python /home/users/testuser/horovod/examples/pytorch/pytorch_synthetic_benchmark.py --batch-size=96 --fp16-allreduce --model=resnet50
further testing:
https://github.com/determined-ai/determined-ee/pull/1067
Checklist
bumpenvs
procedure in the determined repo. See README.