-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Halo module #42
Add Halo module #42
Conversation
Source code fails the linting process, I committed it with 'git commit --no-verify' to get the PR started. Will fix the formatting issues tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brian-oneill In addition to the changes mentioned here, the test code is hanging on Chrysalis with Intel, so I suspect we have a race condition somewhere. I'm trying to track it down and will let you know.
Another comment. In your test routine, you copy the reference result into the test array. I realize that it's an easy way to fill the test array in a generic test function, but it leaves open the possibility that the test will pass if the Halo doesn't do anything. I think you want to initialize the test array in the Owned region only. That might mean you pass a mask into the test routine to zero out the halo entries before the halo update. I actually did a quick alternative test that only fills the owned region for one of the 1-d tests and it still passes, so I think the routines are still working fine except for the random failures I'm hitting on Chrysalis. But it would be a more robust test if you fill the halo with garbage or zeros before the halo call. |
Formatting issues are fixed, linting CI shouldn't complain now. I'll hold off on pushing until any other needed updates are ready. Working on changes to the unit test now. I'll also try the test on Chrysalis and check out these random FAILs. |
@brian-oneill Returning to this today, I can no longer reproduce the random fails on Chrysalis, so not sure what was going on there. If it's a synchronization thing, it might pop up again later, but I guess we'll deal with it then. So other than the temporary renaming of the srch vector routine and the removal of the return statement in the test driver, it all seems to be working now. |
@brian-oneill , @philip W. Jones, I encountered an error while running testHalo on Chrysalis. The error message is as follows:
It appears that the |
@brian-oneill , To enable CTest to identify failed test cases, "HALO_TEST" needs to be added in the "set_tests_properties" function within the CMake configuration at ${E3SM_HOME}/components/omega/test/CMakeListss.cmake, as follows: set_tests_properties( |
2. Brought up to date with CMake changes 3. Fixed issues preventing build on Chrysalis 4. Made requested unit test changes 5. Added MPI_Wait calls after non-blocking sends, should resolve synchronization issue noticed on Chrysalis 6. Update to Halo class to properly account for number of halo layers in edge and vertex index spaces
PR should be good to go now. I did encounter some of the random FAILs on Chrysalis, I added some MPI_Wait calls after the sends are initiated that I think should resolve the issue. I didn't encounter any FAILs after making this change. @philipwjones On a separate note, the IO test is broken now, which I suspect needs to be brought up to date with the memory cleanup changes. |
Thanks @brian-oneill I'll re-review today. And yes, I'm working on several IOTest fixes, not only the memory cleanup but some others that showed up when I fixed the funky syntax that Maciej discovered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good now and all requested changes have been made. Unit tests pass on Chrysalis. Frontier down for maintenance but past issues were all on Chrysalis anyway.
Tested successfully on chrysalis and perlmutter as follows
with results:
Note, the IO_TEST failure is known and unrelated to this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked through the HaloTest.cpp
code, which is a good example of how this is used and a good verification. Approving based on these tests and Phil's review.
Adds halo exchange capability for supported array types, including:
Changes have been built and tested on Frontier.
Checklist