parallel performance with OpenMP #2

capitalaslash · 2024-04-02T10:19:30Z

Describe the bug
Parallel performance with OpenMP is severely degraded wrt serial.

To Reproduce
Steps to reproduce the behavior:

I run the same test case (cavity-amr) in serial and parallel.
I configured with

-DUSE_ACC=OFF
-DUSE_OMP=ON

I tested the parallel performance with several applications and poses to try to understand why the parallel runs are much much slower, but to no avail.
Do you have any idea why this happens?
Should I try to activate USE_ACC in the configuration?

Expected behavior
At least similar performance, if not an improvement in computational time.

Screenshots

$> time ./test.sh -n 1 -c examples/cavity-amr/cavity > cavity-amr-serial.log 

real    0m1.364s
user    0m5.955s
sys     0m0.170s
$> time ./test.sh -n 2 -c examples/cavity-amr/cavity > cavity-amr-omp.log 

real    44m6.228s
user    515m37.217s
sys     0m21.519s

Desktop (please complete the following information):

OS: OpenSUSE
Version: Leap 15.5

Additional context
None

The text was updated successfully, but these errors were encountered:

danielabdi-noaa · 2024-04-02T12:17:35Z

Hi @capitalaslash

I think there are several factors at play here

You don't need to compile with -DUSE_OMP=ON to test MPI parallelization so you can use the default build
The cavity example does not have enough number of cells to show speedup either for the MPI or OpenMP implementation, but it shouldn't take 44 min.

Note that when you use the MPI+OpenMP combo by turning on -DUSE_OMP=ON, the number of threads specified is per MPI rank. So make sure you set OMP_NUM_THREADS to a specific value first. I think if that is not set it may try to assign all available threads to each MPI rank thus severely degrading performance. MPI is recommended for most applications since it is faster than OpenMP implementation (as is the case in other CFD codes as well), but it may help to reduce MPI communication overhead and improve scalability when very large number of MPI ranks are used which is not the case for most people.

What I suggest is to first test the MPI implementation using the default build steps outlined

    mkdir build && cd build
    cmake -DCMAKE_BUILD_TYPE=release -DCMAKE_INSTALL_PREFIX=.. ..
    make && make install

You can then test by setting different values to the -n option of the test.sh script. Don't have to worry about setting OMP_NUM_THREADS. You should get reasonable times in all cases, but I am certain the MPI runs will be slower for this simple test case that does not have enough number of grid points to show scalability.

Then, to test the OpenMP vs MPI implementation, make another build

    mkdir build-omp && cd build-omp
    cmake -DCMAKE_BUILD_TYPE=release -DCMAKE_INSTALL_PREFIX=.. -DUSE_OMP=ON ..
    make && make install

When running the OpenMP build make sure to set -n 1 since that specifies the number of MPI ranks. Then before calling the script make sure to set export OMP_NUM_THREADS=1 or 2 or .... The OpenMP implementation will certainly be slower than the MPI implementation but it has its value in MPI+OpenMP mode with very large number of MPI ranks.

Here are some test runs to test the scalability of the MPI version first, the first build.
Let us use the cavity test case (not the amr one for this one). Modify examples/cavity/cavity file to increase the number of grid cells from 20x20 to 100x100 as follows

-8{0 1 2 3 4 5 6 7} linear 3{20 20 1}
+8{0 1 2 3 4 5 6 7} linear 3{100 100 1}

Run single MPI rank run

$ ./test.sh -n 1 -c examples/cavity/cavity 
...
9018 [0] Time 5.000000
9025 [0] SYMM-FULL-SSOR-PCG :Iterations 1 Initial Residual 1.47888e-11 Final Residual 1.14971e-11
9027 [0] SYMM-FULL-SSOR-PCG :Iterations 1 Initial Residual 1.44703e-11 Final Residual 1.12991e-11
9073 [0] Exiting application run with 1 processes

Takes about 9073 milliseconds
Run 2-mpi ranks

$ ./test.sh -n 2 -c examples/cavity/cavity
....
4623 [0] Time 5.000000
4626 [0] SYMM-FULL-SSOR-PCG :Iterations 1 Initial Residual 3.55206e-11 Final Residual 3.16645e-11
4628 [0] SYMM-FULL-SSOR-PCG :Iterations 1 Initial Residual 3.38580e-11 Final Residual 3.00837e-11
4672 [0] Exiting application run with 2 processes

Takes about 4672 milliseconds for a speedup of 1.94x out of 2 which is good.

Lets do the same with the OpenMP implementation. Make sure to use the build-omp binaries by doing:

cd build-omp && make install

Then we set number of threads to 2

export OMP_NUM_THREADS=2

We can now run the test case with 1 mpi rank

$ ./test.sh -n 1 -c examples/cavity/cavity
....
6686 [0] Time 5.000000
6691 [0] SYMM-FULL-SSOR-PCG :Iterations 1 Initial Residual 1.47888e-11 Final Residual 1.14971e-11
6693 [0] SYMM-FULL-SSOR-PCG :Iterations 1 Initial Residual 1.44703e-11 Final Residual 1.12991e-11
6739 [0] Exiting application run with 1 processes

It took about 6739 milliseconds. It is slower than the MPI implementation but still faster than the serial implementation with a speedup of 1.34x times.

I hope this helps. Please let me know if you encounter additional issues.

Daniel

danielabdi-noaa · 2024-04-02T16:05:54Z

@capitalaslash I have now added basically what I described above in the README file so that others will not be confused by it. Thank you!

capitalaslash · 2024-04-12T13:17:09Z

Ok, so I was already running with OpenMP on all my cores and oversubscribing with MPI led to the degraded performance.
Thank you for updating the description, it is now much clearer in my opinion.
This issue can be closed.

capitalaslash mentioned this issue Apr 2, 2024

[REVIEW]: NebulaSEM: A high-order discontinuous Galerkin spectral element code for atmospheric modeling openjournals/joss-reviews#6448

Closed

capitalaslash closed this as completed Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel performance with OpenMP #2

parallel performance with OpenMP #2

capitalaslash commented Apr 2, 2024

danielabdi-noaa commented Apr 2, 2024 •

edited

Loading

danielabdi-noaa commented Apr 2, 2024

capitalaslash commented Apr 12, 2024

parallel performance with OpenMP #2

parallel performance with OpenMP #2

Comments

capitalaslash commented Apr 2, 2024

danielabdi-noaa commented Apr 2, 2024 • edited Loading

danielabdi-noaa commented Apr 2, 2024

capitalaslash commented Apr 12, 2024

danielabdi-noaa commented Apr 2, 2024 •

edited

Loading