-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Intermittent hang in "to_netcdf" call from write_netcdf. #250
Comments
Things to try next:
|
Given a file containing one or more CMIP6 dataset_ids, the script:
will take the positional parameters "WORK <file_of_dataset_ids>", and for each given dataset_id, will produce a local subordinate script
that is the "minimum script" to conduct end-to-end ([NCO] + e3sm_to_cmip) processing. NOTE: The parent script (dsm_generate_CMIP6_dsid_list_2.sh) requires that you have
in your .bashrc so that various datasm tools that garner the dataset_id_specific parameters (data location, proper mapfile, etc etc) can be obtained. But the resulting subordinate script contains the fully-expressed command lines (with ALL paths fully-qualified) and can be run from anywhere. The results will be placed into various subdirectories under ADDENDUM: If you substitute "TEST" for "WORK" as the first parameter, only 1 year of data will be processed, and the resulting output files will not be moved into the warehouse (they will remain in the tmp/<case_id>/rgr and product directories. Near the top of the script, you can set "dryrun=1", in which case the generated subordinate scripts will not be run, only produced and ready to run. |
What happened?
Occasional hang-forever on "to_netcdf". No errors or exceptions raised.
What did you expect to happen? Are there are possible answers you came across?
Similar behaviors have been noted in:
pydata/xarray#4710
“Most of the time, this command works just fine. But in 30% of the cases, this would just... stop and stall. One or more of the workers would simply stop working without coming back or erroring.”
and then:
If you run this once, it's typically fine. But run it over and over again in a loop, and it'll eventually hang on mfd.to_netcdf. However if I set lock=False then it runs fine every time.
It seems related to a discussion regarding whether HDF5 is/is-not thread-safe, and whether locking is-not/is necessary, respectively.
Minimal Complete Verifiable Example (MVCE)
Relevant log output
Anything else we need to know?
The salient history of local discussion, oldest to newest:
Mar 13, 1:07 PM
Environment
populated config files : /home/bartoletti1/mambaforge/.condarc
conda version : 24.1.2
conda-build version : not installed
python version : 3.10.6.final.0
solver : libmamba (default)
virtual packages : __archspec=1=broadwell
__conda=24.1.2=0
__glibc=2.17=0
__linux=3.10.0=0
__unix=0=0
base environment : /home/bartoletti1/mambaforge (writable)
conda av data dir : /home/bartoletti1/mambaforge/etc/conda
conda av metadata url : None
channel URLs : https://conda.anaconda.org/conda-forge/linux-64
https://conda.anaconda.org/conda-forge/noarch
package cache : /home/bartoletti1/mambaforge/pkgs
/home/bartoletti1/.conda/pkgs
envs directories : /home/bartoletti1/mambaforge/envs
/home/bartoletti1/.conda/envs
platform : linux-64
user-agent : conda/24.1.2 requests/2.31.0 CPython/3.10.6 Linux/3.10.0-1160.108.1.el7.x86_64 rhel/7.9 glibc/2.17 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.7
UID:GID : 61843:4061
netrc file : None
offline mode : False
The text was updated successfully, but these errors were encountered: