Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syncing doesn't work with wandb #30

Closed
barthelemymp opened this issue Jan 17, 2023 · 26 comments · Fixed by #31 or #34
Closed

Syncing doesn't work with wandb #30

barthelemymp opened this issue Jan 17, 2023 · 26 comments · Fixed by #31 or #34
Labels
bug Something isn't working

Comments

@barthelemymp
Copy link

barthelemymp commented Jan 17, 2023

Hello,

First thank you for creating this tool!
Unfortunately I do not manage to make it work.
I have got this error each time I use trigger_sync:

^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M

I am not sure where it comes from... any idea ?

best

b

@klieret
Copy link
Owner

klieret commented Jan 17, 2023

Just double checking: You do run the wandb-osh script on your head node as well, right?

The hook (whose output you see above) creates a file in ~/.wandb_osh_command_dir that tells the wandb-osh what to sync for every epoch. Every time wandb-osh then syncs, it removes the file again. If however, the sync hasn't happened yet (so the file still exists) and the next epoch already completes, then you see this warning

@klieret klieret added the question Further information is requested label Jan 17, 2023
@klieret klieret changed the title not syncing "Syncing not active or too slow" warnings Jan 17, 2023
@barthelemymp
Copy link
Author

barthelemymp commented Jan 18, 2023

Thank you, for your reply.
I am a bit confused on how to use the wandb-osh command.
should I add the command in the shell script where I lauch the python script ?
thank you

@klieret
Copy link
Owner

klieret commented Jan 18, 2023

Could you describe your setup? Since you're using this package, I assume you are running your ML on a batch system where the compute nodes don't have internet.
In this case, submit your jobs, including the hook in the code as shown on the readme, and then, on the same server where you submitted your jobs, start wandb-osh in parallel.

@barthelemymp
Copy link
Author

Yes, I have a head node from which I lauch jobs on the computing node with sbatch. And yes, the head node has internet and the others don t.
"on the same server where you submitted your jobs, start wandb-osh in parallel." you mean the head node ?

Tell me if this is right:
(HEAD)$ sbatch myscript.sh
(HEAD)$ tmux new -s wosh
(HEAD)$ wandb-osh

the myscript.sh looks like that:

#!/bin/bash
#SBATCH --job-name=pytorch_mnist     # job name
#SBATCH --ntasks=1                   # number of MP tasks
#SBATCH --ntasks-per-node=1          # number of MPI tasks per node
#SBATCH --gres=gpu:1                 # number of GPUs per node
#SBATCH --cpus-per-task=10           # number of cores per tasks
#SBATCH --hint=nomultithread         # we get physical cores not logical
#SBATCH --distribution=block:block   # we pin the tasks on contiguous cores
#SBATCH --time=3:00:00              # maximum execution time (HH:MM:SS)
#SBATCH --output=pytorch_mnist%j.out # output file name
#SBATCH --error=pytorch_mnist%j.err  # error file name

set -x
cd ${SLURM_SUBMIT_DIR}
export WANDB_MODE="offline"
module purge
module load pytorch-gpu/py3/1.11.0

python ./mnist_example.py 

@barthelemymp
Copy link
Author

I I try what I just proposed I get in std err of my script:

^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[33mWARNING: Syncing not active or too slow: Command /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command file still exists^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M
^[[36mDEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/files.command^[[0m^M

and in the tmux session where wandb-osh is running I have:

INFO: Starting to watch /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_170057-2dyqzdo6/files...                                                           
wandb: No runs to be synced.

@klieret
Copy link
Owner

klieret commented Jan 18, 2023

Yes, that's the correct procedure. The first Syncing not active or too slow is to be expected (and doesn't matter at all), because you start wandb-osh after two epochs have already been completed.

The real question is why wandb sync is showing No runs to be synced.

I actually think this is a bug in wandb-osh: It seems to set up wandb/offline-run-20230118_170057-2dyqzdo6/files for syncing, rather than just wandb/offline-run-20230118_170057-2dyqzdo6/

I've always tested with ray tune, so that's why I might not have been aware of this.

I will fix this in the next two hours and then let you know. I'd be super happy if you could test again then.

@klieret klieret added bug Something isn't working and removed question Further information is requested labels Jan 18, 2023
@klieret klieret changed the title "Syncing not active or too slow" warnings Syncing doesn't work with wandb Jan 18, 2023
@barthelemymp
Copy link
Author

thanks :) I ll do that.

@barthelemymp
Copy link
Author

I manage to make it work with
wandb-osh -- --include-offline /gpfswork/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-*
idk if it helps

@klieret
Copy link
Owner

klieret commented Jan 18, 2023

Yes, that fixes the bug with the wrong run directories that were assumed by wandb_osh. I've now fixed that in v1.0.3.

Could you test my fix by updating the package (pip3 install --upgrade wandb_osh) and then simply trying with wandb-osh (no other arguments required)

@klieret
Copy link
Owner

klieret commented Jan 18, 2023

@all-contributors please add @barthelemymp for bug

@allcontributors
Copy link
Contributor

@klieret

I've put up a pull request to add @barthelemymp! 🎉

@barthelemymp
Copy link
Author

barthelemymp commented Jan 18, 2023

nope: still get

NFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_193002-fnmizw5d/files...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_193002-fnmizw5d/files...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_193002-fnmizw5d/files...
wandb: No runs to be synced.

when installing I had to add the path by hand. do you have a command to check that the wandb-osh I call is the updated one ?

@klieret
Copy link
Owner

klieret commented Jan 18, 2023

Can you check python3 -m pip freeze | grep wandb-osh for the version?
Because this still looks like it's using the old version...

@klieret
Copy link
Owner

klieret commented Jan 18, 2023

Alternatively, you can do

import wandb_osh
print(wandb_osh.__version__)

@barthelemymp
Copy link
Author

barthelemymp commented Jan 18, 2023

wandb-osh==1.0.3I : I have the right version, and the problem remains.

Tell ms if I can do some more test on my side.

Best Barthelemy

@klieret
Copy link
Owner

klieret commented Jan 18, 2023

Just double checking: It's also updated in the python you use in the batch scripts, right? (just in case you use some conda env there, etc.). The fix was related to the hook that is included in the python package, not the wandb-osh executable.

Because I cannot believe that it still points to the paths that end in /files with the new version...

You could also do

python -m pip install --upgrade --force-reinstall 'wandb-osh@git+https://github.com/klieret/wandb-offline-sync-hook.git@main'

and then try again, as the newest version now prints out the version number at the beginning

@klieret
Copy link
Owner

klieret commented Jan 18, 2023

If running your toy analysis is too much work, you can also try this simple snippet here:

#!/usr/bin/env python3

import wandb
import os
from wandb_osh.hooks import TriggerWandbSyncHook

sync_hook = TriggerWandbSyncHook()

os.environ['WANDB_SILENT'] = 'true'
os.environ["WANDB_MODE"] = "offline"
wandb.init()
wandb.log({"loss": 123})
sync_hook()

Run it and it should print something like

INFO: This is wandb-osh v1.0.3 using communication directory /Users/fuchur/.wandb_osh_command_dir
DEBUG: Wrote command file /Users/xxx/.wandb_osh_command_dir/1cf846.command

and if you do cat /Users/xxx/.wandb_osh_command_dir/1cf846.command (use the path from the debug message you just saw), it should show something like

/Users/xxx/Documents/23/git_sync/wandb-osh-tests/wandb/offline-run-20230118_155559-1rgh98sl

(note how it doesn't end in /files)

@barthelemymp
Copy link
Author

So it is printing the right version:

INFO: wandb-osh v1.0.3, starting to watch /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_234022-59ewemor...
wandb: No runs to be synced.
INFO: Syncing /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230118_234022-59ewemor...
wandb: No runs to be synced.

thank you foryour commitment :)

@klieret klieret linked a pull request Jan 18, 2023 that will close this issue
@klieret
Copy link
Owner

klieret commented Jan 18, 2023

Yes, now it points to the correct paths; that should work.

Are you running any training in parallel? Because if you synced manually or before, maybe there really is nothing to be synced.

Also, can you check in your script's output what wandb tells you to do for syncing: I usually see something like

wandb: Waiting for W&B process to finish... (success).
wandb:
wandb: Run history:
wandb: iterations_since_restore ▁▃▅▆█
wandb:            mean_accuracy ▁▄█▆▇
(...)
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/kl5675/ray_results/ray-tune-slurm-test/Trainable_f15a6ba8_8_conf_out_channels=9,lr=0.0013,momentum=0.1681_2023-01-18_18-36-34/wandb/offline-run-20230118_183635-f15a6ba8
wandb: Find logs at: ./wandb/offline-run-20230118_183635-f15a6ba8/logs
== Status ==

and the path after You can sync this run should be the same that we see in the output from wandb-osh

@barthelemymp
Copy link
Author

Here it is :

DEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/422f4a.command
DEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/422f4a.command
DEBUG: Wrote command file /linkhome/rech/genjqx01/urz96ze/.wandb_osh_command_dir/422f4a.command
wandb: Waiting for W&B process to finish... (success).
wandb: 
wandb: Run history:
wandb: avg_a ▁▆▆▇▇▇▇▇█▇████
wandb: avg_e █▃▃▂▁▁▁▁▁▁▁▁▁▁
wandb: 
wandb: Run summary:
wandb: avg_a 9909
wandb: avg_e 0.02842
wandb: 
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230119_013623-fuzjfsll
wandb: Find logs at: ./wandb/offline-run-20230119_013623-fuzjfsll/logs

@klieret
Copy link
Owner

klieret commented Jan 19, 2023

The link shown above has exactly the same structure of the links as shown in the output of wandb-osh itself... I really don't see how this shouldn't work...

If you had wandb-osh running in parallel, it probably also showed exactly the path

/gpfsdswork/projects/rech/mdb/urz96ze/jean-zay-doc/docs/examples/pytorch/mnist/wandb/offline-run-20230119_013623-fuzjfsll

right?
Because that means wandb-osh runs exactly the command that wandb suggests in the log...

@klieret
Copy link
Owner

klieret commented Jan 19, 2023

Just another guess: Could it be that you have started another instance of wandb-osh in the background? Or something else that already syncs?

In either case, do you see the runs being synced to the wandb web interface? (that still wouldn't say "No runs to be synced")

@klieret
Copy link
Owner

klieret commented Jan 19, 2023

OK, I found one more thing: On my laptop wandb sync requires a path, even when in the right directory (else it will exactly show the 'no runs to be synced'), whereas on my cluster it doesn't. It's strange because it's the same version of wandb.

But let me change that in the package real quick.

@klieret klieret linked a pull request Jan 19, 2023 that will close this issue
@klieret
Copy link
Owner

klieret commented Jan 19, 2023

OK. Could you try

python3 -m pip --upgrade wandb-osh

and try one last time? The version should then be 1.0.4

I'm very sorry to use you as a beta tester here ;) But I'm absolutely confident that it will work now :)

@barthelemymp
Copy link
Author

Clap Clap!!

It works!

@klieret
Copy link
Owner

klieret commented Jan 19, 2023

Awesome! Thank you so much again :)

@klieret klieret closed this as completed Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants