-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Formulation_Manager.hpp to use boost regex #843
base: master
Are you sure you want to change the base?
Conversation
Hi Josh, and thanks for noticing the issue, looking into it, and proposing this fix. We'll need to check whether We'd definitely like to bring down excessive run times like you've observed. Would you be open to making some other changes and testing them, since you've already got this teed up? |
Yeah of course! Happy to do any additional testing, just let me know what I should try. As for boost regex being header only, I don't think it needs to be built? the dockerfile I'm using just downloads boost.bzip, unzips it, then adds the folder to path before building ngen. I don't have to make install it or anything if that answers the question? my docker patch is just two sed commands with no additional changes to how ngen's built RUN sed -i 's|#include <regex>|#include <boost/regex.hpp>|g' include/realizations/catchment/Formulation_Manager.hpp
RUN sed -i 's/std::regex/boost::regex/g' include/realizations/catchment/Formulation_Manager.hpp the dockerfile is this one if it's of any use/interest :) |
Co-authored-by: Phil Miller - NOAA <[email protected]>
@PhilMiller What about changing if(forcing_prop_map.count("path") != 0){
path = forcing_prop_map.at("path").as_string();
} to if(forcing_prop_map.count("path") != 0){
path = forcing_prop_map.at("path").as_string();
int id_index = path.find("{{id}}");
if (id_index != std::string::npos) {
path = path.replace(id_index, sizeof("{{id}}") - 1, identifier);
}
} means that I can put the {{id}} in the path, then remove file_pattern from my realization so that it hits the early return before any regex is used. |
The idea for Meanwhile, I added lines in the PR description's timing tables for "boost+exact" describing the code as amended to check against literal |
If you're feeling diligent, |
@PhilMiller, thanks for looping me in. I don't foresee this being an issue. @robertbartel, do you feel differently? |
Well, according to this, Boost.Regex is not header-only (someone sanity check me to make sure I'm not missing something). I don't see a problem with adding support for |
I was surprised the timings with the filepattern exact check before attempting the regex weren't measurably faster. Without attaching a debugger I'd guess that the filepattern match only returns true at most once per pattern, but still runs the regex against every other non-exact file match. So in this example it would only prevent regex being used 1/6500th of the time? I'll rerun to confirm and update the tables |
Oh, yeah, if we're going to try for exact match, we should just try opening the specific file, rather than testing against the entire enumerated contents of the directory. |
@@ -463,8 +463,7 @@ namespace realization { | |||
if (directory != nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something like this?
I just copied the code that stats the file after the regex match.
I tested it just using std and it matched the performance of adding {{id}} to the path variable, 27s serial 3.2s parallel 56x
if (directory != nullptr) { | |
if (directory != nullptr) { | |
// check if directory + file_pattern is a file before attempting to iterate | |
struct stat st; | |
if (stat((path + filepattern).c_str(), &st) == 0) { | |
if (S_ISREG(st.st_mode)) { | |
return forcing_params( | |
path + filepattern, | |
provider, | |
simulation_time_config.start_time, | |
simulation_time_config.end_time | |
); | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correction* 3.7s in 56x parallel. So not as fast as changing path
to accept {{id}}, but probably only because I didn't stat the file when I did that, I just accepted the path without verifying if it existed. the speed difference serially is ~27.2 vs ~27.8 but my testing it just manually running it 5 times so I haven't been including the decimal place in the longer timings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed an apparent bug in that code. We probably shouldn't be checking S_ISREG
, since that would (I think) exclude symbolic links, which should be legitimate, as long as they point at an existent file.
All of this code kinda suffers from the syndrome of "asking permission, rather than forgiveness" - we should ideally just be trying to directly open the files in question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I misunderstood the original code a bit too, it's looping over all the files and checking matches, then returning at the first match it finds right? I mistook it to return all matching files. If it's only returning the first match I can have a go at reworking it to
- try and open the file without stat
- if that fails, build a list of every file in the directory
- run regex against that list
- if more than one match is found try to open the first match
The line between performance optimization and hack is blurry but I was wondering if it would be faster to get a string of all filenames, separated by some illegal filename character like hash or pipe, then run regex once against that string to pull out the matching files. Rather than running regex once per file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully this is more clear than my fumbling with suggested changes master...JoshCu:ngen:open_files_first_ask_questions_later
How about this? I modified it to just try and open the file immediately like you suggested, then if that fails fall back to the regex matching, which also just tries to open the file rather than using stat. The bit at the top of the diff is just inverting the if directory != nullptr
logic to unindent the code one layer.
In terms of performance, it's fractionally slower (~3% serial, ~10% Parallelx55 with a lot of deviation) than modifying path to accept {{id}} and bypassing the entire directory opening and regex section #843 (comment)
I'm not a fan of littering file and directory closes around before every possible exit path so if there's some nicer way to consolidate the cleanup I'd love to hear it.
@PhilMiller If this is still of interest, what are the next steps? Options
Any combination of these work, but 2 might not be needed with 3 as it is only marginally quicker than 3 1 | This is the smallest code change, a moderate speed improvement, 2 | This is the fastest, slightly larger change, but doesn't verify that the file exists before accepting it 3 | Slightly slower than 2, much larger change, seems to be a good option assuming my code is ok? I'm happy to redo any changes and put then in a new pull request if needed :) |
As a note, boost.regex is header-only in the general case (at least as of 1.79), as noted here. There is a build requirement for C++03 and below, and when using ICU support, but I don't think that is needed in this PR unless I'm missing something. edit: looks like in 1.76 boost.regex changed to being header-only.
|
Is that why I didn't have to do anything other than building boost for the regex change to work? The highest version in rocky 9 epel is/was 1.75 so I had to build 1.79 from source anyway https://github.com/JoshCu/NGIAB-CloudInfra/blob/7cef18572883f4b9ec04471b650a87f860e488e3/docker/Dockerfile#L28 |
Yeah that'd be my assumption. Another verification that boost.regex doesn't need any additional building: our CI builds the changes successfully, and we only link to the For older boost versions, boost.regex requires the edit: to summarize, I think we are good to use boost.regex without worrying about extra dependencies. |
Thanks, @program--. Looks like they (still) haven't updated Boost's Getting Started page to clearly reflect that and I didn't look hard enough earlier. If what's stated here in the dependency doc is still correct, a non-header-only scenario for this shouldn't be possible. A bit of an aside: I wanted to quickly see if any other Boost libraries listed as "must be built" didn't really belong there, so I took a brief look at Boost.Chrono (the first one in that list). Besides details on Boost.Chrono directly, I came across this interesting line about its Boost.System dependency:
So, buyer beware on any of the Getting Started "must be built" libraries. |
Update the formulation manager to use boost::regex instead of std.
When matching complex regex for finding forcing files, std::regex becomes a significant portion of Ngen init. Especially with large numbers of catchments per core.
For an extreme example, when running serially for wb-479197 and it's upstreams ~6500 catchments. NGen::init takes ~153s, if I use the line "forcing": {"file_pattern": ".*{{id}}.*.csv" in my realization.
removing those wildcards reduces the time to 69 seconds.
For use cases where ngen is being run over a large area, this init time can become a non-insignificant portion of the total runtime, in the serial example, it took 153 seconds to init, 43s to run the models, 19s to route for a 24h simulation.
Changes
Testing
This was all tested on ngiab images using ngen f91e2ea
The run package was generated using the ngiab_preprocessor
24h run of ~6500 catchments, sloth + cfe + noaa-owp-modular + troute, subset from wb-479197 on hydrofabric v20.1
Hardware used
Model name: Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz Thread(s) per core: 2 Core(s) per socket: 14 Socket(s): 2 CPU max MHz: 3600.0000 CPU min MHz: 1200.0000 RAM 2100mhz DDR4 Storage ~500mbs sata ssdRealization.json
boost+exact
first checks if file_pattern matched the file exactly, then runs regex if it does not matchexact path
is where the "path" formulation variable accepts the{{id}}
placeholder allowing for file_pattern to be omitted and no regex being run.Serial
10 mpi ranks
56 mpi ranks
Screenshots
interactive version of the graph here
std, boost, re2, on 10 cores
perf Flamegraph of unmodified fomulation manager running serially with double wildcards
Notes
Future work
I wanted to keep changes minimal because I'm not too familiar with the rest of ngen, but would it be worth modifying the formulation manager further to optionally disable regex completely?
For my use-case, the forcing files are just named cat-1234.csv and there is one per catchment so I don't need to use regex at all after the formulation manager replaces the {{id}} placeholder.