You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using wandb-osh on a multi-gpu cluster with SLURM using pytorch-lightning.
It correctly syncs the initial sanity validation phase, but then at the end of the first epoch it crashes:
Hmm, not sure... Are there multiple processes/nodes? Perhaps the sanity check is still run on one node/process and works but then there's multiple forks on different nodes and wandb.run is not set everywhere?
I should be using 1 node with 4 GPUs; I am attaching my slurm configuration in case it may help.
#SBATCH -A my_acc # account, shared by all the project members
#SBATCH --time 24:00:00 # max time limit after which the job is killed, format: HH:MM:SS
#SBATCH -N 1 # 1 node, should match the number of nodes in Pl.Trainer
#SBATCH --ntasks-per-node=4 # 4 tasks out of 32, should match the number of devices in Pl.Trainer
#SBATCH --gres=gpu:4 # 4 gpus per node out of 4
#SBATCH --mem=100000 # memory per node out of 494000MB (481GB)
conda activate myenv
wandb offline
HYDRA_FULL_ERROR=1 HF_HUB_OFFLINE=1 srun python src/scripts/train.py
Hello!
I am using
wandb-osh
on a multi-gpu cluster with SLURM usingpytorch-lightning
.It correctly syncs the initial sanity validation phase, but then at the end of the first epoch it crashes:
Any clue what this might be about?
Thanks in advance!
The text was updated successfully, but these errors were encountered: