-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure running t-route in ngen worker image #472
Comments
Can you make a |
@hellkite500, sure: Output of
|
Can you try with pyarrow 11? Still not sure that underlying issue has been completely addressed upstream. |
Yeah I suspect it is either pyarrow or tables. How are you installing tables? |
I've tweaked the image to ensure pyarrow 11.0.0 is installed. This was the command to install tables:
I may be installing t-route incorrectly somehow, as I'm getting this error now. I'll continue looking into it.
|
There is a new package/step needed with recent versions of t-route. |
As an aside, I still have trouble installing the I don't think at this point that's contributing to the primary error, but it could be an issue later. |
Yeah it looks like
|
Sorry, was AFK. Just looked at the install script and it looks like it should be installing |
@robertbartel, are you checking out a specific commit or branch? |
I may have the issues fixed in the image to get t-route working, though now I am running into some peculiar configuration validation errors:
@yuqiong77 provided the original config I was using for testing. I don't have enough experience with t-route to sanity check things beyond Regardless, I am at least going to tweak the configuration and run tests until I get a successful job completion. |
Happy New Year! I'm pressed for time to complete some multi-year streamflow simulation runs (either within the ngen image Bobby has helped build or as an post-processing step) for my AMS presentation. My sincerest thanks to you all for looking into the t-route issue. |
I'm going to put together at least a draft PR for this to build images for @yuqiong77, but I'm still running into an error. It does appear to be a more t-route-specific problem - perhaps still related to the configuration - and not one with the image.
Doing some limited checking, it looks like this is implying |
@robertbartel Thanks. I also suspect the config I used (which was based on an example found in the t-route repository a few weeks ago) may have some issues. The example config file looks quite different from the t-route config files I used back in 2022, which did not have a data assimilation section. Looking at the DA section of the current config, I think the only line that may cause an issue is the following:
What if we comment out that line? |
There seem to be at least some t-route problems contributing to this, which I've opened issue NOAA-OWP/t-route#719 to track. |
I think the problems in part are due to using a troute v3.0 config with troute v4.0 execution. If I tweak part of the
Then I get past the earlier attribute and validation errors, although now I run into this/these:
|
Hi Bobby, Thanks for figuring out the mismatch between t-route config and execution. I find a v4 example of the config in the repository: Based on that, I modified my config file on UCS6:
I now get the following error:
Any hint? |
Indeed, I encountered the There is still some trouble though. In short, ngen seems to be outputting a bogus line at the end of one of the terminal nexus output files (in particular, |
Bobby, which tnx file are you referring to specifically? I opened |
@yuqiong77, we were having issues with |
Thanks! I see that now. Although the last line in my file
|
For sure, @yuqiong77! Well that is odd. I am just jumping back into this thread, so I am not sure if @robertbartel was using a different set of forcing data that you are for your simulations. With the modifications, @robertbartel suggested to make to the |
Probably ignore this, just documenting it because it is related. As @robertbartel, found out yesterday, the extra line in the stack trace2024-01-08 20:00:06,620 INFO [AbstractNetwork.py:125 - assemble_forcings()]: Creating a DataFrame of lateral inflow forcings ...
terminate called after throwing an instance of 'pybind11::error_already_set'
what(): InvalidIndexError: Reindexing only valid with uniquely valued Index objects
At:
/usr/local/lib64/python3.9/site-packages/pandas/core/indexes/base.py(3875): get_indexer
/usr/local/lib64/python3.9/site-packages/pandas/core/reshape/concat.py(676): get_result
/usr/local/lib64/python3.9/site-packages/pandas/core/reshape/concat.py(393): concat
/usr/local/lib64/python3.9/site-packages/troute/HYFeaturesNetwork.py(611): build_qlateral_array
/usr/local/lib64/python3.9/site-packages/troute/AbstractNetwork.py(127): assemble_forcings
/usr/local/lib/python3.9/site-packages/nwm_routing/__main__.py(121): main_v04 In short, |
@aaraney I just tested with the config file that @robertbartel posted (I think my config had the
|
@yuqiong77, well at least we are having issues in the same place! From the directory where the NextGen output files are can you please run |
@ajkhattak is correct. You could put any string you want in those two entries when running in NextGen. We put compiler directives to skip over the forcing read and output write routines. E.g. https://github.com/NOAA-OWP/noah-owp-modular/blob/04e8ac02532c9a292098f974cdb03aa03bfbfcd6/src/RunModule.f90#L210 |
@ajkhattak @SnowHydrology That's great to know! I was checking the source code |
@yuqiong77 Were you ever able to track down the exact time and location (basin ID) of the failure? I'd be interested to see the forcing data corresponding to the failure just in case there is anything interesting in the file. |
@SnowHydrology No, I have not been able to track down the exact time and location of the failure. The screen output or error message did not indicate any catchment ids. What makes the debugging difficult is that the error would only occur after 9 to 10 months into the run, which would take close to 20 hours clock time (in the serial mode, since at the moment running in the parallel mode would produce spurious lines in the ngen output). I'll try to dig a bit deeper to see if I can identify the catchment of the failure. |
@yuqiong77 that's likely because the error print out is coming from Noah-OM, which doesn't know which catchment it's running in. Maybe the output files can indicate where Noah-OM failed? |
Also tagging @GreyEvenson-NOAA here. The Noah-OM issue is described here originally: #472 (comment) |
Some progress on identifying problematic catchments ... For 2933 catchments, the ngent outputs contain nan values from the very first time step, e.g.,
I checked the forcing files of these catchments and didn't find anything suspicious ... Will keep digging and report back |
@yuqiong77 are you also saving the Noah-OM outputs? That would help with diagnosing any issues. |
With help from @ajkhattak , I think we have found the issue. The parameter values in the CFE config files for those problematic catchments were not set correctly. Likely there was a bug in the script that I used to populate the parameter values from the regionalization. |
but we still need to dig deeper to investigate it further, unless I am missing something, I don't think that wrong CFE config file inputs caused the Noah-OM |
@yuqiong77 / @ajkhattak was the supposed config issue with the |
@aaraney I doubt |
Hi all, just wanted to let you know that Ahmad has been helping me debugging and he was able to run ngen successfully for a year with my realization config (CFE + Noah-OM) and BMI files for HUC-01 catchments. He was also able to run successfully without CFE. He carried out both runs outside of the container. @ajkhattak if I miscommunicated or missed something, please correct. But all of my runs (with or without CFE) within the container failed at around 9-10 months, with the same error (negative FIRA) in Noah-OM. So I'm wondering if the issue has something to do with the image that @robertbartel help build, in particular related to the Noah-OM module contained in that image? |
Thanks for reporting back @yuqiong77! I was afraid it would be difficult to diagnose. Unfortunately it could be a myriad of things from the version of the Noah-OM code @ajkhattak used, the complier (gcc vs clang), the optimization level used by the compiler, or even the CPU architecture. @ajkhattak for starters, did you run the experiment on an arm or x86 machine? |
@yuqiong77 Thanks for this update. The error message you got is one of the few checks in Noah-OM that will stop the model. Although the error may manifest as |
@yuqiong77, thank you for the info. Just to confirm, were your runs always with serial ngen, or did you also experience the errors running parallel ngen? If you haven't tried a parallel ngen scenario because of the current issues with that and t-route, could you try your configs in a parallel run (with routing removed of course) and see if the error still occurs? |
@robertbartel yes, all my latest runs that failed were in serial mode. My earlier runs in the parallel mode did not go far because of the t-route issue we ran into. I will launch a parallel run without routing for a year and report back. |
@robertbartel The parallel version ran pretty fast, but unfortunately it still failed at around 7300 time steps with the same error.
|
sorry guys, there were some other issues. The use of I am going to test it on the latest Noah-OM master and see if I can reproduce the error. @GreyEvenson-NOAA I will reach out to you to discuss the debugging further sorry for any confusion... |
Afternoon all, I spent some time looking for a problem in the energy balance simulations and the calculation of vegetation temperature and ground (below veg) temperature in EnergyMain and EtFluxModule but didn't find anything. However, I noticed that in the namelist file that Ahmad gave to me, the soil type is specified as 14, which corresponds to 'water'. The simulation ended successfully -- and with realistic ground temp values -- after changing the soil type to something different (I tried several different non-water soil types). Can someone confirm my observation by changing isltyp to 13 or something else and re-running? @yuqiong77: Does this catchment need to be simulated with a water soil type? If so, I will look into the matter further as the energy and temperature simulations are partly impacted by the properties of the top soil horizon. |
@robertbartel, this issue might be close-able. @ajkhattak and @GreyEvenson-NOAA tracked down the issue in the Noah-OM namelist and we're working on a fix in the hydrofabric. Actually, I just noticed, this particular issue has had quite the evolution, so I don't know if the original issue has been solved. The Noah-OM error has been. |
Thanks @SnowHydrology. The scope did get pretty broad, but I think you are correct in that this can be closed. To be safe though, I want to outline what had been uncovered, and status of addressing that aspect:
@aaraney, @yuqiong77, @ajkhattak, is this all correct? Have I missed anything? |
@ajkhattak would you be willing to document/describe the workflow on this ngen issue? NOAA-OWP/ngen#723 |
Attempts to run framework-integrated t-route execution are failing. Initially, these were encountering a segmentation fault. After some experimental fix attempts, the errors changed first from to a signal 6, then to a signal 7, but t-route still does not run successfully.
The initial suspicion was a problem related to a known NetCDF Python package issue, which is what early fix tries attempted to address (this may still be the root of what's going on).
The text was updated successfully, but these errors were encountered: