Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Halo module #42

Merged
merged 3 commits into from
Jan 25, 2024
Merged

Conversation

brian-oneill
Copy link

Adds halo exchange capability for supported array types, including:

  • Halo module with classes and methods to perform halo exchanges
  • documentation of the Halo module
  • a unit test that executes and verifies a successful halo exchange for test arrays of each supported type

Changes have been built and tested on Frontier.

Checklist

  • Documentation:
    • Design document has been generated and added to the docs
    • User's Guide has been updated
    • Developer's Guide has been updated
    • Documentation has been built locally and changes look as expected
  • Testing
    • A comment in the PR documents testing used to verify the changes including any tests that are added/modified/impacted.
    • CTest unit tests for new features have been added per the approved design.
    • Unit tests have passed. Please provide a relevant CDash build entry for verification.

@brian-oneill
Copy link
Author

Source code fails the linting process, I committed it with 'git commit --no-verify' to get the PR started. Will fix the formatting issues tomorrow.

Copy link

@philipwjones philipwjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brian-oneill In addition to the changes mentioned here, the test code is hanging on Chrysalis with Intel, so I suspect we have a race condition somewhere. I'm trying to track it down and will let you know.

components/omega/src/base/Halo.h Outdated Show resolved Hide resolved
components/omega/src/base/Halo.cpp Outdated Show resolved Hide resolved
@philipwjones
Copy link

Another comment. In your test routine, you copy the reference result into the test array. I realize that it's an easy way to fill the test array in a generic test function, but it leaves open the possibility that the test will pass if the Halo doesn't do anything. I think you want to initialize the test array in the Owned region only. That might mean you pass a mask into the test routine to zero out the halo entries before the halo update. I actually did a quick alternative test that only fills the owned region for one of the 1-d tests and it still passes, so I think the routines are still working fine except for the random failures I'm hitting on Chrysalis. But it would be a more robust test if you fill the halo with garbage or zeros before the halo call.

@brian-oneill
Copy link
Author

Formatting issues are fixed, linting CI shouldn't complain now. I'll hold off on pushing until any other needed updates are ready. Working on changes to the unit test now. I'll also try the test on Chrysalis and check out these random FAILs.

@philipwjones
Copy link

@brian-oneill Returning to this today, I can no longer reproduce the random fails on Chrysalis, so not sure what was going on there. If it's a synchronization thing, it might pop up again later, but I guess we'll deal with it then. So other than the temporary renaming of the srch vector routine and the removal of the return statement in the test driver, it all seems to be working now.

@grnydawn
Copy link

grnydawn commented Jan 7, 2024

@brian-oneill , @philip W. Jones,

I encountered an error while running testHalo on Chrysalis. The error message is as follows:

../src/libOmegaLib.a(Halo.cpp.o): In function `OMEGA::srchVector(std::vector<int, std::allocator<int> > const&, int)':
/home/ac.kimy/repos/github/Omega.Brian/components/omega/src/base/Halo.cpp:29: multiple definition of `OMEGA::srchVector(std::vector<int, std::allocator<int> > const&, int)'
../src/libOmegaLib.a(Decomp.cpp.o):/home/ac.kimy/repos/github/Omega.Brian/components/omega/src/base/Decomp.cpp:63: first defined here

It appears that the srchVector function, with the same signature, exists in both Halo.cpp and Decomp.cpp. Interestingly, this issue did not occur on Frontier.

@grnydawn
Copy link

grnydawn commented Jan 9, 2024

@brian-oneill , To enable CTest to identify failed test cases, "HALO_TEST" needs to be added in the "set_tests_properties" function within the CMake configuration at ${E3SM_HOME}/components/omega/test/CMakeListss.cmake, as follows:

set_tests_properties(
DATA_TYPES_TEST
MACHINE_ENV_TEST
BROADCAST_TEST
LOGGING_TEST
HALO_TEST
YAKL_TEST
PROPERTIES FAIL_REGULAR_EXPRESSION "FAIL"
)

brian-oneill and others added 2 commits January 22, 2024 11:21
2. Brought up to date with CMake changes
3. Fixed issues preventing build on Chrysalis
4. Made requested unit test changes
5. Added MPI_Wait calls after non-blocking sends, should resolve synchronization
   issue noticed on Chrysalis
6. Update to Halo class to properly account for number of halo layers in edge
   and vertex index spaces
@brian-oneill
Copy link
Author

PR should be good to go now. I did encounter some of the random FAILs on Chrysalis, I added some MPI_Wait calls after the sends are initiated that I think should resolve the issue. I didn't encounter any FAILs after making this change.

@philipwjones On a separate note, the IO test is broken now, which I suspect needs to be brought up to date with the memory cleanup changes.

@philipwjones
Copy link

Thanks @brian-oneill I'll re-review today. And yes, I'm working on several IOTest fixes, not only the memory cleanup but some others that showed up when I fixed the funky syntax that Maciej discovered.

@philipwjones philipwjones self-requested a review January 23, 2024 17:37
Copy link

@philipwjones philipwjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good now and all requested changes have been made. Unit tests pass on Chrysalis. Frontier down for maintenance but past issues were all on Chrysalis anyway.

@mark-petersen
Copy link

mark-petersen commented Jan 23, 2024

Tested successfully on chrysalis and perlmutter as follows

######### perlmutter
git submodule update --init --recursive externals/YAKL externals/ekat externals/scorpio cime
cd components/omega/
module load cmake
mkdir build
cd build
export PARMETIS_ROOT=/global/cfs/cdirs/e3sm/software/polaris/pm-cpu/spack/dev_polaris_0_3_0_gnu_mpich/var/spack/environments/dev_polaris_0_3_0_gnu_mpich/.spack-env/view
cmake \
   -DOMEGA_CIME_COMPILER=intel \
   -DOMEGA_CIME_MACHINE=pm-cpu \
   -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT}\
   -DOMEGA_BUILD_TEST=ON \
   -Wno-dev \
   -S .. -B .
ln -s /global/cfs/cdirs/e3sm/inputdata/ocn/mpas-o/oQU240/ocean.QU.240km.151209.nc test/OmegaMesh.nc
./omega_build.sh

salloc --nodes 1 --qos interactive --time 01:00:00 --constraint cpu --account=e3sm
./omega_ctest.sh

######### chrysalis
git submodule update --init --recursive externals/YAKL externals/ekat externals/scorpio cime
cd components/omega/
module load cmake
mkdir build
cd build
export PARMETIS_ROOT=/lcrc/group/e3sm/ac.boneill/intel-openmpi-libs
cmake \
   -DOMEGA_CIME_COMPILER=intel \
   -DOMEGA_CIME_MACHINE=chrysalis \
   -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT}\
   -DOMEGA_BUILD_TEST=ON \
   -Wno-dev \
   -S .. -B .
ln -isf /home/ac.mpetersen/inputdata/ocn/mpas-o/oQU240/ocean.QU.240km.151209.nc test/OmegaMesh.nc
./omega_build.sh

srun -p debug -N 1 -t 1:00:00 --pty bash
./omega_ctest.sh

with results:

Test project /global/homes/m/mpeterse/repos/omega/pr_halo_bryan/components/omega/build
    Start 1: DATA_TYPES_TEST
1/8 Test #1: DATA_TYPES_TEST ..................   Passed    0.69 sec
    Start 2: MACHINE_ENV_TEST
2/8 Test #2: MACHINE_ENV_TEST .................   Passed    0.51 sec
    Start 3: BROADCAST_TEST
3/8 Test #3: BROADCAST_TEST ...................   Passed    0.55 sec
    Start 4: LOGGING_TEST
4/8 Test #4: LOGGING_TEST .....................   Passed    0.08 sec
    Start 5: DECOMP_TEST
5/8 Test #5: DECOMP_TEST ......................   Passed    0.91 sec
    Start 6: HALO_TEST
6/8 Test #6: HALO_TEST ........................   Passed    1.30 sec
    Start 7: IO_TEST
7/8 Test #7: IO_TEST ..........................***Failed    1.48 sec
    Start 8: YAKL_TEST
8/8 Test #8: YAKL_TEST ........................   Passed    0.02 sec

88% tests passed, 1 tests failed out of 8

Total Test time (real) =   5.62 sec

The following tests FAILED:
	  7 - IO_TEST (Failed)

Note, the IO_TEST failure is known and unrelated to this PR.

Copy link

@mark-petersen mark-petersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through the HaloTest.cpp code, which is a good example of how this is used and a good verification. Approving based on these tests and Phil's review.

@philipwjones philipwjones merged commit 5c2d7c9 into E3SM-Project:develop Jan 25, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants