-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyTorch test that uses torchvision #130
Conversation
…mount of numa nodes in a node
…ectly from the refraem.core.systems.ProcessorInfo object. See https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.systems.ProcessorInfo
…cher-agnostic way, by simply passing them to the python script as argument, and having that set it to the environment. Print clear error if SLURM or openMPIs mpirun are not used - we still rely on these to get the local rank, there is no other way
…uired environment variables have been set.
Test is still failing on multiple nodes:
Not a very explicit error... The generated jobscript is e.g.
Note that replacing the Possibly related: |
With
I get
So yeah, it does seem to be about mapping ranks to devices... |
Yep, that's where the problem is. With mpirun:
with srun:
Our first slot will be on node 0 (core 0-17), our second slot will be on node 1 (core 0-17), third slot on node 0 (core 18-35), etc. Then, MPI ranks by slot, so: first rank will be on node 0 (core 0-17), second rank will be on node 1 (core 0-17), third rank on node 0 (core 18-35), etc. This is different from
which assumes that consequitive ranks are mapped to the same node, until that node is full, i.e. It's probably better / cleaner to do something like was done by CSCS at https://github.com/eth-cscs/cscs-reframe-tests/blob/9c076fe2c1904b8e41c56184dba58de657d8c3a7/checks/apps/pytorch/src/pt_distr_env.py . I.e. seperate setting the distributed environment in a different script. They also set the I'll first do a fix on my current code, since this:
runs succesfully. Using
After that is done, I'll refactor to split off the preparation from the actual test. |
Merging #137 and synchronizing this branch with |
Ok,
It's been going for 11 minutes now, the others all finished in ~2 minutes. I'll retest this one later, it might just be an issue on one of the nodes I got allocated - and not an issue with this test at all. |
Retest succeeded:
|
Full run, including CPU tests, was succesfull:
(note that the other 12 tests are correctly skipped since they are invalid combinations. E.g. |
@casparvl can you pls merge main, so it's easier to compare and see what changed? |
eessi/testsuite/hooks.py
Outdated
@@ -57,7 +57,7 @@ def _assign_default_num_gpus_per_node(test: rfm.RegressionTest): | |||
|
|||
def assign_tasks_per_compute_unit(test: rfm.RegressionTest, compute_unit: str, num_per: int = 1): | |||
""" | |||
Assign one task per compute unit (COMPUTE_UNIT[CPU], COMPUTE_UNIT[CPU_SOCKET] or COMPUTE_UNIT[GPU]). | |||
Assign one task per compute unit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assign one task per compute unit. | |
Assign one task per compute unit. More than 1 task per compute unit can be assigned with num_per for compute units that support it. |
please also update line 83 to:
if num_per != 1 and compute_unit not in [COMPUTE_UNIT[NODE]]:
# Hybrid code, so launch 1 rank per socket. | ||
# Probably, launching 1 task per NUMA domain is even better, but the current hook doesn't support it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess these comments no longer apply, as we are now launching 1 rank per numa node?
hooks.set_compact_process_binding(self) | ||
|
||
@run_after('setup') | ||
def set_ddp_env_vars(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def set_ddp_env_vars(self): | |
def set_ddp_options(self): |
# Set environment variables for PyTorch DDP | ||
if self.parallel_strategy == 'ddp': | ||
# Set additional options required by DDP | ||
self.executable_opts += ["--master-port $(python python_get_free_socket.py)"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was a bit confused by the name of this script. can we call it get_free_port.py
instead?
# the compute threads. It was fixed, but seems to be broken again in Horovod 0.28.1 | ||
# The easiest workaround is to reduce the number of compute threads by 2 | ||
if self.compute_device == DEVICE_TYPES[CPU] and self.parallel_strategy == 'horovod': | ||
self.env_vars['OMP_NUM_THREADS'] = max(self.num_cpus_per_task - 2, 2) # Never go below 2 compute threads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if you have only 1 or 2 cores? is it still better to have 2 compute threads in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take it out. We don't have parallel_strategy=horovod
in this test anyway, that was a remnant of our (internal) test that I ported.
If we ever deploy horovod (with PyTorch support), we could bring this back. But then it'd be a horovod test, not a pytorch test :)
i ran a test with 1 GPU but it failed. apparently
|
Darn, I realize I didn't look into this yet... need to still fix it, but don't have time now :\ |
…onditionally, as this would be evaluated on the login node. That environment is irrelevant to the batch job
single-GPU test:
|
2-GPU test:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good!
Work in progress...