-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix GPU restart for pure SoA particles #3783
Fix GPU restart for pure SoA particles #3783
Conversation
Thank you for debugging and fixing this! ✨ There are a few CI test issues left:
and
|
For Nyx restart, the current development branch works on CPU and GPU. With this branch, it works on CPU, but fails on GPU. https://ccse.lbl.gov/pub/GpuRegressionTesting/Nyx/2024-03-02/index.html |
I just fixed some things, hopefully everything will pass now. |
The Nyx tests pass now. But this IAMR non-restart test still fails. https://ccse.lbl.gov/pub/GpuRegressionTesting/IAMR/2024-03-02-001/Part-2d.html |
The reconnection GPU tests also fail. But the problem might be in the reconnection code, not amrex. This PR: https://ccse.lbl.gov/pub/GpuRegressionTesting/reconnection/2024-03-02/index.html development: https://ccse.lbl.gov/pub/GpuRegressionTesting/reconnection/2024-03-01/index.html |
The amrex ParticleMesh test fails. https://ccse.lbl.gov/pub/GpuRegressionTesting/AMReX/2024-03-02/ParticleMesh.html |
For reconnection, I also tried an earlier version (that works with the development branch of amrex) with this PR, the errors were similar to those of the latest reconnection code plus this PR. https://ccse.lbl.gov/pub/GpuRegressionTesting/reconnection/2024-03-02-001/index.html |
Update : I do see the illegal access memory with this current PR, when I ran with 2 GPUs - same as https://ccse.lbl.gov/pub/GpuRegressionTesting/reconnection/2024-03-02-001/Baseline2Dsigma30Perturbation.html 1 . The test where I compiled with the dev branch crashed at the first checkpoint I/O, and this PR fixes that issue
|
The issue Reva is seeing was fixed in #3769, the branch of this PR is outdated and hasn't got that fix yet. |
The reconnection code works now. |
Remaining issues:
Or is the old behavior actually wrong? |
The issue is in the GPU version of packIOData. If I use managed memory and force it to use the CPU version of packIOData, it works. |
There might be another issue in the GPU version of packIOData besides the one that causes the regression failure. The |
@AlexanderSinn @WeiqunZhang Thanks! |
To summarize the current status, all regression tests I have tried have passed (including the reconnection) except for a few tests in amrex and IAMR after the development branch was merged into this. (Yes, I merged and pushed to this PR.) The amrex and IAMR regressions issues are new issues introduced in this PR. The code change suggestion I made above seems to solve the new regression issues. (No, I did not commit it). There is an existing defect regarding the filtering flags. That should be easy to fix either in this PR or a follow-up. |
To clarify, the reconnection tests work with the current PR and I can approve this PR once the new IAMR and amrex issued are resolved. |
Co-authored-by: Weiqun Zhang <[email protected]>
Just checking for my understanding:
Is that fixed in the merge or to do? |
The WarpX restart issues seem to pass now based on the minimal reproducer. The offsets bug is still to-do. |
The proposed changes: