TrainerDDPMixin.resolve_root_node_address
fails if the host name contains a dash
#1943
Labels
help wanted
Open to be worked on
🐛 Bug
When running under SLURM, if the nodes host names contains
-
,MASTER_ADDR
omits the number part of the master host name (which of course makes everything else fail).if
MASTER_ADDR
is not given, pl infers it from the node list usingTrainerDDPMixin.resolve_root_node_address
To Reproduce
Steps to reproduce the behavior:
jean-zay-ia810
)The job should fail with something like this
In my case, looking into
os.environ
gives'SLURM_NODELIST': 'jean-zay-ia[810,817-819]'
and'MASTER_ADDR': 'jean-zay-ia'
.The root cause being that
>>> resolve_root_node_address("jean-zay-ia[810,817-819]") 'jean-zay-ia'
Fixing it should not be hard, should I submit a PR?
The text was updated successfully, but these errors were encountered: