Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GPU restart for pure SoA particles #3783

Merged
merged 4 commits into from
Mar 6, 2024

Conversation

atmyers
Copy link
Member

@atmyers atmyers commented Mar 1, 2024

The proposed changes:

  • fix a bug or incorrect behavior in AMReX
  • add new capabilities to AMReX
  • changes answers in the test suite to more than roundoff level
  • are likely to significantly affect the results of downstream AMReX users
  • include documentation in the code and/or rst files, if appropriate

@atmyers atmyers requested a review from ax3l March 1, 2024 21:14
@ax3l ax3l requested a review from WeiqunZhang March 2, 2024 02:30
@ax3l ax3l added the bug label Mar 2, 2024
ax3l

This comment was marked as outdated.

@ax3l
Copy link
Member

ax3l commented Mar 2, 2024

Thank you for debugging and fixing this! ✨

There are a few CI test issues left:

0::Assertion `sm_old == sm_new' failed, file "/Users/runner/work/amrex/amrex/Tests/Particles/CheckpointRestartSOA/main.cpp", line 174 !!!

and

1::Assertion `ptemp.id() > 0' failed, file "/Users/runner/work/amrex/amrex/Src/Particle/AMReX_ParticleIO.H", line 1022 !!!

@WeiqunZhang
Copy link
Member

For Nyx restart, the current development branch works on CPU and GPU. With this branch, it works on CPU, but fails on GPU.

https://ccse.lbl.gov/pub/GpuRegressionTesting/Nyx/2024-03-02/index.html

@atmyers
Copy link
Member Author

atmyers commented Mar 3, 2024

I just fixed some things, hopefully everything will pass now.

@WeiqunZhang
Copy link
Member

The Nyx tests pass now. But this IAMR non-restart test still fails. https://ccse.lbl.gov/pub/GpuRegressionTesting/IAMR/2024-03-02-001/Part-2d.html

@WeiqunZhang
Copy link
Member

The reconnection GPU tests also fail. But the problem might be in the reconnection code, not amrex.

This PR: https://ccse.lbl.gov/pub/GpuRegressionTesting/reconnection/2024-03-02/index.html

development: https://ccse.lbl.gov/pub/GpuRegressionTesting/reconnection/2024-03-01/index.html
The restart test started Friday morning was never able to finish, and I manually killed it today.

@RevathiJambunathan

@WeiqunZhang
Copy link
Member

@WeiqunZhang
Copy link
Member

For reconnection, I also tried an earlier version (that works with the development branch of amrex) with this PR, the errors were similar to those of the latest reconnection code plus this PR. https://ccse.lbl.gov/pub/GpuRegressionTesting/reconnection/2024-03-02-001/index.html

@RevathiJambunathan
Copy link
Contributor

RevathiJambunathan commented Mar 3, 2024

Update : I do see the illegal access memory with this current PR, when I ran with 2 GPUs - same as https://ccse.lbl.gov/pub/GpuRegressionTesting/reconnection/2024-03-02-001/Baseline2Dsigma30Perturbation.html

1 . The test where I compiled with the dev branch crashed at the first checkpoint I/O, and this PR fixes that issue

  1. However, when I run the current PR, the test still crashes. Now it write the checkpoint file, but crashes at RedistributeLocal(). Backtrace points to
    partitionParticlesByDest (PTile& ptile, const PLocator& ploc, CellAssignor&& assignor,
    and
    return partitionParticles(ptile,

@AlexanderSinn
Copy link
Member

The issue Reva is seeing was fixed in #3769, the branch of this PR is outdated and hasn't got that fix yet.

@WeiqunZhang
Copy link
Member

The reconnection code works now.

@WeiqunZhang
Copy link
Member

@WeiqunZhang
Copy link
Member

The issue is in the GPU version of packIOData. If I use managed memory and force it to use the CPU version of packIOData, it works.

@WeiqunZhang
Copy link
Member

There might be another issue in the GPU version of packIOData besides the one that causes the regression failure. The IntVector offsets computed by exclusive sum is never used. If the particle io flags is not all true, that will not be right.

@ax3l ax3l added the GPU label Mar 5, 2024
@RevathiJambunathan
Copy link
Contributor

RevathiJambunathan commented Mar 5, 2024

@AlexanderSinn @WeiqunZhang Thanks!
Yes, updating branch now!

@WeiqunZhang
Copy link
Member

WeiqunZhang commented Mar 6, 2024

To summarize the current status, all regression tests I have tried have passed (including the reconnection) except for a few tests in amrex and IAMR after the development branch was merged into this. (Yes, I merged and pushed to this PR.) The amrex and IAMR regressions issues are new issues introduced in this PR. The code change suggestion I made above seems to solve the new regression issues. (No, I did not commit it). There is an existing defect regarding the filtering flags. That should be easy to fix either in this PR or a follow-up.

@WeiqunZhang
Copy link
Member

To clarify, the reconnection tests work with the current PR and I can approve this PR once the new IAMR and amrex issued are resolved.

@atmyers atmyers merged commit f1ef81e into AMReX-Codes:development Mar 6, 2024
68 of 69 checks passed
@ax3l
Copy link
Member

ax3l commented Mar 7, 2024

Just checking for my understanding:
Did WarpX restarts pass now? :)

There is an existing defect regarding the filtering flags. That should be easy to fix either in this PR or a follow-up.

Is that fixed in the merge or to do?

@atmyers
Copy link
Member Author

atmyers commented Mar 7, 2024

The WarpX restart issues seem to pass now based on the minimal reproducer. The offsets bug is still to-do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants