-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributed.scheduler.KilledWorker for a long time series #236
Comments
Thanks for the report. Frankly I don't have a good intuition of what's going on here, but (as usual) I suspect @spencerkclark will. |
Yes, I think there is a workaround. As part of #172, we added an option of specifying an external distributed client in the main script (i.e. you can start your own cluster with settings to your liking and tell
Here I'm specifying only 3 processes (this is the maximum number of concurrent calculations that will be done using this worker; I'm decreasing it from 6 and you might need to decrease it more depending on the size of your files).
import os
from distributed import Client
scheduler_file = os.path.join('/home', os.environ['USER'], 'scheduler.json')
calc_exec_options = dict(
prompt_verify=True,
parallelize=True,
client=Client(scheduler_file=scheduler_file),
write_to_tar=True,
)
Note: If you want to do more calculations concurrently (even with the large set of files), you could launch more workers on separate analysis nodes using the same command used above (this will give you more memory to play with). In other words repeat step (2) on a few separate nodes. @chuaxr let me know if that makes sense. |
Thanks @spencerahill and @spencerkclark for the replies. @spencerkclark: I didn't have a /home/$USER/scheduler.json, so I copied what was in /home/skc/scheduler.json to /home/xrc. Running step 1 shows the following messages, but nothing else, even after waiting around 10 minutes. Is there something else that needs to be specified in the file?
I'm also trying to figure out why specifying the workers helps in this case. Before, it was trying to load 10 files x 6 variables into memory. Does specifying only 3 processes mean that it will try to load 10 files x 3 variables instead? |
@chuaxr sorry that I wasn’t clearer in my instructions. I’m away from the computer right now, so I can’t write a longer message, but two quick things:
And yes your understanding of why I believe this should help is correct. Let me know if you’re able to make more progress now. Thanks! |
suspicions confirmed 😄 |
@spencerkclark Thanks for the clarifications. I opened separate windows for the scheduler, workers, and the window for sending the script. The following error message appeared, after which nothing appeared to happen (I killed the job after about 5 minutes). wp25aug is the function I use to do pre-processing in WRF_preprocess.py.
|
@chuaxr where is the |
@spencerkclark Sorry about that. I had a version of wrf_preprocess in /home/xrc/WRF (which is on my python path) and another version in the directory (/home/xrc/WRF/aospy_code) where I normally send my aospy_main scripts from (which has the wp25aug function). I'm surprised that it would look there instead of in the aospy_code, as it does if I just run the script without using the scheduler. Anyway, copying wrf_preprocess to /home/xrc/WRF resolved that issue. The same KilledWorker error occurred with 3 processes, with many warnings like those below. Presumably this was the memory issue again, so I tried again with 1 process. I still received a bunch of warnings like the following:
The good news is that the script ran without error and the files were produced. I also saw The bad news is that the process took around 90 minutes, which is probably longer than running things in separate scripts. Ideally, there would be a way to let aospy know that it only needs one data file for each of the calculations. Or perhaps submitting as a batch job would allow for a larger memory limit? |
Sorry, I guess I was a little confused; I thought you wanted to use data from all 10 files? In theory, we've set things up so that loading of data should be lazy (in other words if you only need data from a subset of the files, it will only load data from those files into memory). My impression was that:
Was that wrong?
In this case if you open another terminal window, to a different node on the analysis cluster, and launch another worker, it will have the same effect. For example, if you launched workers on |
@spencerkclark re our offline conversation: Creating 10 separate base runs so that each calculation only loads the data file it requires instead of all In the process of creating the runs programmatically, it seems that I needed to create variables programmatically as well. I ended up doing the following, where RCE_base_runs is a list of aospy.Run objects:
Is setting variables via globals() a safe thing to do? |
@chuaxr could you give a little more context as to why using a list of |
Exactly -- this was my hypothesis, because
It turns out this behavior can be modified in xarray version 0.10.0 by using a keyword argument in |
|
Perfect 👍
Yes, I believe in that case it would just load the 1D index coordinates (which are much smaller) and infer how the multi-dimensional coordinates could be concatenated based on those. |
@chuaxr FYI we are eventually going to create an But I don't have an estimate of even when we'll start on that in earnest. Hopefully early 2018 but no guarantee 😦 |
@spencerahill thanks for the note. At the moment, #208 is a larger issue for me-- I'm using a lot of conditional sampling (e.g. vertical velocity corresponding to the 99.99th percentile of precipitation or precipitation exceeding 500 mm/day). Having to write separate functions for each variable I wish to perform this operation over, not to mention having to hard-code the thresholds, is getting to be rather unwieldy. If the amount of effort required would be similar, then I would appreciate it if it were focused on #208 rather than on this issue. |
@chuaxr for sure, this is definitely our top priority moving forward. Like @spencerkclark , I likely won't have time to see it through before the new year, but it will be the first thing we tackle in 2018. Thanks for your patience! |
I currently wish to perform averages of model variables (say condensates) on a 100-days worth of data (~100x100x50 at hourly resolution) in 10 day blocks for a base experiment, as well as 10 day blocks for different experiments. Each 10 day chunk is in its own file.
When attempting to perform an average over 6 variables on the 10 files in the 100 day run using parallelize = True, I obtain the following error:
The calculation completes successfully if I attempt to average the same 6 variables with files from 10 different experiments. The calculation also completes if I simply average 1 variable over the 10 files in the base run. That said, averaging 1 variable over the 10 files in the base run takes longer (about 10 minutes) than averaging the 6 variables with the files from the 10 different experiments (about 2 minutes). Perhaps the files from the same experiment are being combined into one long time-series behind the scenes and therefore leading to a memory error.
Is there a better workaround than to simply submit calculations for each variable separately?
The text was updated successfully, but these errors were encountered: