-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove MPI_Barriers before routing to increase speed. #846
base: master
Are you sure you want to change the base?
Conversation
I really like the idea here, but there's a fundamental challenge. As MPI is specified, we can't actually be confident that anything after a call to |
Ok, I've looked at this a little bit more, and here's my tentative suggestion: Rather than having all of the non-0 processes finalize and exit after the catchment simulation is complete, have them all call Is that something you want to try to implement? If not, I can find some time to work up an alternative patch. In the latter case, do you mind if I push it to your branch for this PR? Ultimately, we expect to more deeply integrate t-route with parallelization directly into ngen with BMI. Until that's implemented, though, it's pretty reasonable to find expedient ways to improve performance. |
Yeah I'll give it a go :) I might not get to it for another day or so but I'm happy to do it |
This version works so far as the waiting mpi threads now no longer max out the CPU, but it's tricky to say what I'm actually measuring in terms of performance since this went through NOAA-OWP/t-route#795 I also wasn't sure if |
Testing this on some longer runs, ~6500 catchments cfe + noaa owp for 6 months, shows that speedup isn't as dramatic now that the routing has that multiprocessing change in it. Before MPI changeFinished routing
NGen top-level timings:
NGen::init: 21.621
NGen::simulation: 537.511
NGen::routing: 837.387
real 23m17.312s
user 1836m41.492s
sys 102m31.267s After change NGen::init: 21.6584
NGen::simulation: 529.272
NGen::routing: 783.689
real 22m15.342s
user 640m16.918s
sys 164m18.954s |
Performance Testing ResultsIn summary: it makes routing ~1.3x-1.4x faster Test Setup
Results
2024-09-04 23:27:24,582 - root - INFO - [__main__.py:340 - main_v04]: ************ TIMING SUMMARY ************
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:341 - main_v04]: ----------------------------------------
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:342 - main_v04]: Network graph construction: 20.63 secs, 17.12 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:349 - main_v04]: Forcing array construction: 29.57 secs, 24.55 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:356 - main_v04]: Routing computations: 59.15 secs, 49.09 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:363 - main_v04]: Output writing: 10.99 secs, 9.12 %
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:370 - main_v04]: ----------------------------------------
2024-09-04 23:27:24,583 - root - INFO - [__main__.py:371 - main_v04]: Total execution time: 120.33999999999999 secs
Finished routing
NGen top-level timings:
NGen::init: 103.262
NGen::simulation: 234.654
NGen::routing: 120.578 and that init time can be reduced to 26 seconds this master...JoshCu:ngen:open_files_first_ask_questions_later |
Recording this here with for reference: |
Improve ngen-parallel Performance with T-route
Problem Statement
When running ngen-parallel with
-n
near or equal to core count, T-route's performance is severely degraded. T-route doesn't parallelize well with MPI, and moving the finalize step to before the routing frees up CPU resources for T-route to use different parallelization strategies.Changes Made
The troute change is semi related as it converts a for loop that consumes the majority of t-route's execution time to use multiprocessing. That performance improvement doesn't work while MPI wait is consuming every CPU cycle.
Performance Results
Testing setup: ~6500 catchments, 24 timesteps, dual Xeon 28C 56T total (same as #843)
96 was performed on a different 96 core machine
56 were all performed on a 56 core machine
Explanation
Future Work
Testing
Performance Visualization
Perf flamegraph of the entire ngen run (unpatched), should be interactive if downloaded
Additional Notes
return 0;
right after finalize formpi_rank != 0;
. I thought finalize would kill/return the subprocesses but without explicitly returning them I got segfaults. Reccomendations seem to be that as little as possible should be performed after calling finalize, but seeing as all computation after that point is done by one thread I can just return the others?Next Steps