Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI Scaling fix #894

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

MPI Scaling fix #894

wants to merge 12 commits into from

Conversation

JoshCu
Copy link

@JoshCu JoshCu commented Oct 28, 2024

TLDR - netcdf cache and partition remotes hinder mpi scaling

The netcdf per feature data provider cache is expensive to create and the performance penalty increases with mpi rank count. Depending on the machine, it can be faster to have it disabled at just 2 ranks .
The mid-simulation mpi communication to accumulate nexus outputs adds a lot of overhead that can be deferred to the end of the simulation for another large speedup.
A 916 second simulation takes 401s on 96 ranks without these changes and 21s with these changes both applied.
With one netcdf file per rank, cache enabled, and remotes disabled, the speedup on 96 cores over serial is now ~80x. Without these changes 96 cores is ~1x-2x faster than serial.

There are a few different things included in this PR that probably could and should be implemented better than the way I've done them, but I thought it would be better share what I've got so far before revising it further.

Major Differences

Netcdf Cache

The reason this is so expensive is that it caches the value for every catchment at a given timestep for a given variable. Serially this seems to make things run 1.3x-1.5x faster than with no cache. When using mpi partitions, each process reads the values for all of the catchments, including catchments not in the current mpi rank. The result is a huge amount of IO that gets worse with the number of mpi ranks and the number of catchments.
image

Profiling showed ~80% of the simulation runtime was read syscalls for 55 mpi ranks. Reading from a netcdf was taking 180s but per catchment csv files was taking 50s.
I tested just duplicating the netcdf for each rank which was worse ~214s and then splitting the original input up by rank so only the relevant data would be read. Splitting up the netcdf reduced the simulation time to 39s, disabling the cache entirely also gave me a 39s runtime.
I think the machine I did most of my testing on has some large system level cache as serially the ngen cache only gives me a <7% performance boost on inputs <1Gb.

The best performance seems to be splitting up the netcdf by rank and keeping cache enabled, but it's not that much faster than just disabling the cache.

Remotes partitioning

This might be a bit specific to how I'm running ngen, but for cfe+noah-owp-modular it seems like I can just treat each catchment as an individual cell unaffected by those around it and the only communication needed is to sum up the outputs in cases where more than one catchment feeds into a nexus.
Removing the remotes saves on the overhead of the mpi communication and decouples the ranks. If a nexus output is being written from multiple ranks it will corrupt the output. To fix this I just add the mpi rank number to the output file name and make sure each nexus appears at most once per rank. I do this by sorting the catchment to nexus pairs by nexus then assigning them round robin to the partitions. This will break if the number of partitions is smaller than the maximum number of catchments flowing to a single nexus in your network.
The additional post processing to rename the output files and sum up the different ranks is a tiny fraction of the time saved by partitioning this way.
For now this is just a python script as it was much faster for me to write.
As is the script to generate the round robin partitions.

Minor changes

option to disable catchment outputs

I couldn't figure out if there was another way to disable this so I added a root level realization config option "write_catchment_output" that sets the catchment stream out to /dev/null.

MPI Barrier

I noticed before in #846 that the MPI barriers during t-route execution were slowing down troute a bit because troute tries to parallelize without using MPI. A similar but less noticeable effect is happening during the model execution phase too. It's less noticeable because there aren't other processes trying to use the cores, but when a rank finishes the poll rate of the mpi barrier busy wait is so high that it maxes out that core. On my machine this was causing the overall clock speed to drop and the simulation was taking ~5% longer to execute. This seems like something that should be a configurable mpi build arg or environment variable but I couldn't find it so I replaced the barriers with non-blocking Ibarriers that poll every 100ms.

{{partition_id}} {{id}} substitution in forcing input path

As part of the testing I did with one netcdf per mpi rank, I added a rank id placeholder to the "path" variable in the forcing provider config. I also added {{id}} to it to speed up the csv init load time similar to #843.

Testing

To make this easier to test I built a multi-platform docker image called joshcu/ngiab_dev from this dockerfile https://github.com/JoshCu/NGIAB-CloudInfra/blob/datastream/docker/Dockerfile
and put together some example data for 157 catchments 84 nexuses for one year.
The partitioning is handled by a bash script that calls a python script in the docker file and troute won't work with remote partitioning disabled. These things can be run manually though with scripts inside the container.

# Running the docker container
docker run --rm -it -v "/mnt/raid0/ngiab_netcdf/speedup_example:/ngen/ngen/data" joshcu/ngiab_dev /ngen/ngen/data/ auto <num_desired_partitions>
# I had io performance issues on mac related to bind mounts and did the following as a workaround
# Run the container
docker run --rm -it joshcu/ngiab_dev
# make the data destination directory 
mkdir /ngen/ngen/
# copy the speedup_example data to /ngen/ngen/data with docker cp

# finally, run the ngen run helper script in the container
/ngen/HelloNGEN.sh /ngen/ngen/data auto <num_desired_partitions>

# Manually running
# Partition generation
python /dmod/utils/partitioning/round_robin_partioning.py -n <num_partitions> <hydrofabric_path> <output_filename>
# Merging the per rank output files if remotes are disabled
python /dmod/utils/data_conversion/no_remote_merging.py /ngen/ngen/data/outputs/ngen/
# Running t-route manually
python -m nwm_routing -V4 -f config/troute.yaml

Configuration in realization.json

        "forcing": {
            "path": "./forcings/by_catchment/forcings.nc",
            "provider": "NetCDF",
            "enable_cache": false
        }
    },
    "time": {
        "start_time": "2010-01-01 00:00:00",
        "end_time": "2011-01-01 00:00:00",
        "output_interval": 3600
    },
    "write_catchment_output": false,
    "remotes_enabled": false, # routing will not work with this set to false
    "output_root": "/ngen/ngen/data/outputs/ngen"
}

157 cats, 1 year

All of the x86 tests were on ubuntu 24 machines running docker using -v bind mounts

x86 96 core dual Xeon Platinum 8160

serial cache = 916s
serial no-cache = 924s

remotes no-remotes
cache 401s 386s
no-cache 76s 21s

x86 56 core dual Xeon E5-2697

serial cache = 528s
serial no-cache = 771s

remotes no-remotes
cache 215s 146s
no-cache 129s 35s

x86 24 core i9-12900

serial cache = 180s
serial no-cache = 256s

remotes no-remotes
cache 146s 94s
no-cache 106s 33.5s

arm64 mac mini m2

10 ranks 2-cores bind mount

disk io seems to be terrible with docker -v bind mounts on mac
and kubernetes was only running with 2 cores
serial cache = 301s
serial no-cache = 375s

remotes no-remotes
cache 358s 358s
no-cache 263s 249s

10 ranks 2-cores, in memory

I started up the container then used docker cp to load the input files
serial cache = 233s
serial no-cache = 299s

remotes no-remotes
cache 309s 287s
no-cache 207s 157s

8 ranks 8 cores, in memory

remotes no-remotes
cache 190s 68s
no-cache 169s 50s

55 cats 31 nexus 10 years

all done on the 96 core machine
serially : ~1700s

31 ranks remotes no-remotes
cache 838s 121s
no-cache N/A 76s
csv files N/A 94s
55 ranks remotes no-remotes
cache N/A 187s
no-cache N/A 39s
csv files N/A 50s

~1600 cats 2 years

numbers prefixed with ~ were run for 3 months then multiplied by 8
serially with cache ~324m
serially with no-cache ~399m

96 ranks remotes no-remotes
cache ~316m ~23.5m
no-cache ~11.7m 232s
cache + partitioned nc N/A 184s

@JoshCu
Copy link
Author

JoshCu commented Nov 4, 2024

I did some additional testing for 14k catchments on 96 cores and it took 570s to run the simulation with cache and remotes disabled.
The original python script to merge them took too long so I parallelised it and it now takes ~30s.
I also tried it in rust for fun here. Although it's not that useful here except maybe for demonstrating that the per rank files + reformatting could be quite fast if integrated into ngen in c++

@JoshCu JoshCu mentioned this pull request Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants