Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'NoneType' object has no attribute 'dir' #115

Open
crisostomi opened this issue Aug 30, 2024 · 2 comments
Open

'NoneType' object has no attribute 'dir' #115

crisostomi opened this issue Aug 30, 2024 · 2 comments

Comments

@crisostomi
Copy link

Hello!

I am using wandb-osh on a multi-gpu cluster with SLURM using pytorch-lightning.
It correctly syncs the initial sanity validation phase, but then at the end of the first epoch it crashes:

Error executing job with overrides: []
Traceback (most recent call last):
  File "/leonardo_work/IscrC_MGNTC/dcrisost/radiomark/src/scripts/train_watermarker.py", line 46, in <module>
    main()
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/leonardo_work/IscrC_MGNTC/dcrisost/radiomark/src/scripts/train_watermarker.py", line 22, in main
    run(cfg)
  File "/leonardo_work/IscrC_MGNTC/dcrisost/radiomark/src/scripts/train_watermarker.py", line 40, in run
    trainer.fit(model, datamodule)
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 986, in _run
    results = self._run_stage()
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1030, in _run_stage
    self.fit_loop.run()
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 141, in run
    self.on_advance_end(data_fetcher)
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 295, in on_advance_end
    self.val_loop.run()
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 142, in run
    return self.on_run_end()
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 254, in on_run_end
    self._on_evaluation_epoch_end()
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 333, in _on_evaluation_epoch_end
    call._call_callback_hooks(trainer, hook_name)
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 210, in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/wandb_osh/lightning_hooks.py", line 52, in on_validation_epoch_end
    self._hook()
  File "/leonardo/home/userexternal/dcrisost/miniconda3/envs/radiomark/lib/python3.10/site-packages/wandb_osh/hooks.py", line 42, in __call__
    logdir = Path(wandb.run.dir).parent.resolve()
AttributeError: 'NoneType' object has no attribute 'dir'

Any clue what this might be about?

Thanks in advance!

@klieret
Copy link
Owner

klieret commented Aug 30, 2024

Hmm, not sure... Are there multiple processes/nodes? Perhaps the sanity check is still run on one node/process and works but then there's multiple forks on different nodes and wandb.run is not set everywhere?

@crisostomi
Copy link
Author

crisostomi commented Aug 30, 2024

I should be using 1 node with 4 GPUs; I am attaching my slurm configuration in case it may help.

#SBATCH -A my_acc                      # account, shared by all the project members
#SBATCH --time 24:00:00                     # max time limit after which the job is killed, format: HH:MM:SS
#SBATCH -N 1                                # 1 node, should match the number of nodes in Pl.Trainer
#SBATCH --ntasks-per-node=4        #  4 tasks out of 32, should match the number of devices in Pl.Trainer
#SBATCH --gres=gpu:4                        # 4 gpus per node out of 4
#SBATCH --mem=100000                        # memory per node out of 494000MB (481GB)

conda activate myenv

wandb offline

HYDRA_FULL_ERROR=1 HF_HUB_OFFLINE=1 srun python src/scripts/train.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants