Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel performance with OpenMP #2

Closed
capitalaslash opened this issue Apr 2, 2024 · 3 comments
Closed

parallel performance with OpenMP #2

capitalaslash opened this issue Apr 2, 2024 · 3 comments

Comments

@capitalaslash
Copy link
Contributor

Describe the bug
Parallel performance with OpenMP is severely degraded wrt serial.

To Reproduce
Steps to reproduce the behavior:

I run the same test case (cavity-amr) in serial and parallel.
I configured with

-DUSE_ACC=OFF
-DUSE_OMP=ON

I tested the parallel performance with several applications and poses to try to understand why the parallel runs are much much slower, but to no avail.
Do you have any idea why this happens?
Should I try to activate USE_ACC in the configuration?

Expected behavior
At least similar performance, if not an improvement in computational time.

Screenshots

$> time ./test.sh -n 1 -c examples/cavity-amr/cavity > cavity-amr-serial.log 

real    0m1.364s
user    0m5.955s
sys     0m0.170s
$> time ./test.sh -n 2 -c examples/cavity-amr/cavity > cavity-amr-omp.log 

real    44m6.228s
user    515m37.217s
sys     0m21.519s

Desktop (please complete the following information):

  • OS: OpenSUSE
  • Version: Leap 15.5

Additional context
None

@danielabdi-noaa
Copy link
Collaborator

danielabdi-noaa commented Apr 2, 2024

Hi @capitalaslash

I think there are several factors at play here

  • You don't need to compile with -DUSE_OMP=ON to test MPI parallelization so you can use the default build
  • The cavity example does not have enough number of cells to show speedup either for the MPI or OpenMP implementation, but it shouldn't take 44 min.

Note that when you use the MPI+OpenMP combo by turning on -DUSE_OMP=ON, the number of threads specified is per MPI rank. So make sure you set OMP_NUM_THREADS to a specific value first. I think if that is not set it may try to assign all available threads to each MPI rank thus severely degrading performance. MPI is recommended for most applications since it is faster than OpenMP implementation (as is the case in other CFD codes as well), but it may help to reduce MPI communication overhead and improve scalability when very large number of MPI ranks are used which is not the case for most people.

What I suggest is to first test the MPI implementation using the default build steps outlined

    mkdir build && cd build
    cmake -DCMAKE_BUILD_TYPE=release -DCMAKE_INSTALL_PREFIX=.. ..
    make && make install

You can then test by setting different values to the -n option of the test.sh script. Don't have to worry about setting OMP_NUM_THREADS. You should get reasonable times in all cases, but I am certain the MPI runs will be slower for this simple test case that does not have enough number of grid points to show scalability.

Then, to test the OpenMP vs MPI implementation, make another build

    mkdir build-omp && cd build-omp
    cmake -DCMAKE_BUILD_TYPE=release -DCMAKE_INSTALL_PREFIX=.. -DUSE_OMP=ON ..
    make && make install

When running the OpenMP build make sure to set -n 1 since that specifies the number of MPI ranks. Then before calling the script make sure to set export OMP_NUM_THREADS=1 or 2 or .... The OpenMP implementation will certainly be slower than the MPI implementation but it has its value in MPI+OpenMP mode with very large number of MPI ranks.


Here are some test runs to test the scalability of the MPI version first, the first build.
Let us use the cavity test case (not the amr one for this one). Modify examples/cavity/cavity file to increase the number of grid cells from 20x20 to 100x100 as follows

-8{0 1 2 3 4 5 6 7} linear 3{20 20 1}
+8{0 1 2 3 4 5 6 7} linear 3{100 100 1}

Run single MPI rank run

$ ./test.sh -n 1 -c examples/cavity/cavity 
...
9018 [0] Time 5.000000
9025 [0] SYMM-FULL-SSOR-PCG :Iterations 1 Initial Residual 1.47888e-11 Final Residual 1.14971e-11
9027 [0] SYMM-FULL-SSOR-PCG :Iterations 1 Initial Residual 1.44703e-11 Final Residual 1.12991e-11
9073 [0] Exiting application run with 1 processes

Takes about 9073 milliseconds
Run 2-mpi ranks

$ ./test.sh -n 2 -c examples/cavity/cavity
....
4623 [0] Time 5.000000
4626 [0] SYMM-FULL-SSOR-PCG :Iterations 1 Initial Residual 3.55206e-11 Final Residual 3.16645e-11
4628 [0] SYMM-FULL-SSOR-PCG :Iterations 1 Initial Residual 3.38580e-11 Final Residual 3.00837e-11
4672 [0] Exiting application run with 2 processes

Takes about 4672 milliseconds for a speedup of 1.94x out of 2 which is good.


Lets do the same with the OpenMP implementation. Make sure to use the build-omp binaries by doing:

cd build-omp && make install

Then we set number of threads to 2

export OMP_NUM_THREADS=2

We can now run the test case with 1 mpi rank

$ ./test.sh -n 1 -c examples/cavity/cavity
....
6686 [0] Time 5.000000
6691 [0] SYMM-FULL-SSOR-PCG :Iterations 1 Initial Residual 1.47888e-11 Final Residual 1.14971e-11
6693 [0] SYMM-FULL-SSOR-PCG :Iterations 1 Initial Residual 1.44703e-11 Final Residual 1.12991e-11
6739 [0] Exiting application run with 1 processes

It took about 6739 milliseconds. It is slower than the MPI implementation but still faster than the serial implementation with a speedup of 1.34x times.

I hope this helps. Please let me know if you encounter additional issues.

Daniel

@danielabdi-noaa
Copy link
Collaborator

@capitalaslash I have now added basically what I described above in the README file so that others will not be confused by it. Thank you!

@capitalaslash
Copy link
Contributor Author

Ok, so I was already running with OpenMP on all my cores and oversubscribing with MPI led to the degraded performance.
Thank you for updating the description, it is now much clearer in my opinion.
This issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants