Fix GPU restart for pure SoA particles #3783

atmyers · 2024-03-01T21:14:42Z

The proposed changes:

fix a bug or incorrect behavior in AMReX
add new capabilities to AMReX
changes answers in the test suite to more than roundoff level
are likely to significantly affect the results of downstream AMReX users
include documentation in the code and/or rst files, if appropriate

ax3l · 2024-03-02T02:37:07Z

Thank you for debugging and fixing this! ✨

There are a few CI test issues left:

0::Assertion `sm_old == sm_new' failed, file "/Users/runner/work/amrex/amrex/Tests/Particles/CheckpointRestartSOA/main.cpp", line 174 !!!

and

1::Assertion `ptemp.id() > 0' failed, file "/Users/runner/work/amrex/amrex/Src/Particle/AMReX_ParticleIO.H", line 1022 !!!

overlooked CI

WeiqunZhang · 2024-03-03T01:34:28Z

For Nyx restart, the current development branch works on CPU and GPU. With this branch, it works on CPU, but fails on GPU.

https://ccse.lbl.gov/pub/GpuRegressionTesting/Nyx/2024-03-02/index.html

atmyers · 2024-03-03T02:17:45Z

I just fixed some things, hopefully everything will pass now.

WeiqunZhang · 2024-03-03T04:10:04Z

The Nyx tests pass now. But this IAMR non-restart test still fails. https://ccse.lbl.gov/pub/GpuRegressionTesting/IAMR/2024-03-02-001/Part-2d.html

WeiqunZhang · 2024-03-03T04:32:18Z

The reconnection GPU tests also fail. But the problem might be in the reconnection code, not amrex.

This PR: https://ccse.lbl.gov/pub/GpuRegressionTesting/reconnection/2024-03-02/index.html

development: https://ccse.lbl.gov/pub/GpuRegressionTesting/reconnection/2024-03-01/index.html
The restart test started Friday morning was never able to finish, and I manually killed it today.

@RevathiJambunathan

WeiqunZhang · 2024-03-03T05:35:30Z

The amrex ParticleMesh test fails. https://ccse.lbl.gov/pub/GpuRegressionTesting/AMReX/2024-03-02/ParticleMesh.html

WeiqunZhang · 2024-03-03T05:46:42Z

For reconnection, I also tried an earlier version (that works with the development branch of amrex) with this PR, the errors were similar to those of the latest reconnection code plus this PR. https://ccse.lbl.gov/pub/GpuRegressionTesting/reconnection/2024-03-02-001/index.html

RevathiJambunathan · 2024-03-03T20:25:19Z

Update : I do see the illegal access memory with this current PR, when I ran with 2 GPUs - same as https://ccse.lbl.gov/pub/GpuRegressionTesting/reconnection/2024-03-02-001/Baseline2Dsigma30Perturbation.html

1 . The test where I compiled with the dev branch crashed at the first checkpoint I/O, and this PR fixes that issue

However, when I run the current PR, the test still crashes. Now it write the checkpoint file, but crashes at RedistributeLocal(). Backtrace points to

amrex/Src/Particle/AMReX_ParticleUtil.H

Line 588 in 3525b4a

partitionParticlesByDest (PTile& ptile, const PLocator& ploc, CellAssignor&& assignor,

and

amrex/Src/Particle/AMReX_ParticleUtil.H

Line 647 in 3525b4a

return partitionParticles(ptile,

AlexanderSinn · 2024-03-04T18:41:09Z

The issue Reva is seeing was fixed in #3769, the branch of this PR is outdated and hasn't got that fix yet.

WeiqunZhang · 2024-03-04T20:10:03Z

The reconnection code works now.

WeiqunZhang · 2024-03-04T20:20:18Z

Remaining issues:

Or is the old behavior actually wrong?

WeiqunZhang · 2024-03-05T00:39:47Z

The issue is in the GPU version of packIOData. If I use managed memory and force it to use the CPU version of packIOData, it works.

WeiqunZhang · 2024-03-05T00:52:59Z

There might be another issue in the GPU version of packIOData besides the one that causes the regression failure. The IntVector offsets computed by exclusive sum is never used. If the particle io flags is not all true, that will not be right.

Src/Particle/AMReX_WriteBinaryParticleData.H

RevathiJambunathan · 2024-03-05T05:47:39Z

@AlexanderSinn @WeiqunZhang Thanks!
Yes, updating branch now!

WeiqunZhang · 2024-03-06T17:21:06Z

To summarize the current status, all regression tests I have tried have passed (including the reconnection) except for a few tests in amrex and IAMR after the development branch was merged into this. (Yes, I merged and pushed to this PR.) The amrex and IAMR regressions issues are new issues introduced in this PR. The code change suggestion I made above seems to solve the new regression issues. (No, I did not commit it). There is an existing defect regarding the filtering flags. That should be easy to fix either in this PR or a follow-up.

WeiqunZhang · 2024-03-06T17:27:17Z

To clarify, the reconnection tests work with the current PR and I can approve this PR once the new IAMR and amrex issued are resolved.

Co-authored-by: Weiqun Zhang <[email protected]>

ax3l · 2024-03-07T12:31:42Z

Just checking for my understanding:
Did WarpX restarts pass now? :)

There is an existing defect regarding the filtering flags. That should be easy to fix either in this PR or a follow-up.

Is that fixed in the merge or to do?

atmyers · 2024-03-07T16:39:58Z

The WarpX restart issues seem to pass now based on the minimal reproducer. The offsets bug is still to-do.

Fix GPU restart for pure SoA particles

a664c25

atmyers requested a review from ax3l March 1, 2024 21:14

WeiqunZhang mentioned this pull request Mar 2, 2024

Update CHANGES for 24.03 #3782

Merged

ax3l requested a review from WeiqunZhang March 2, 2024 02:30

ax3l added the bug label Mar 2, 2024

This comment was marked as outdated.

Sign in to view

This was referenced Mar 3, 2024

checkpoint restart mixes up particle position data and loses particles ECP-WarpX/WarpX#4735

Closed

200 x slower after restart ECP-WarpX/WarpX#4701

Closed

fix cpu and async restarts

5386de9

Merge branch 'development' into soa_gpu_restart_bugfix

8c4ee02

WeiqunZhang reviewed Mar 5, 2024

View reviewed changes

Src/Particle/AMReX_WriteBinaryParticleData.H Outdated Show resolved Hide resolved

ax3l added the GPU label Mar 5, 2024

Update Src/Particle/AMReX_WriteBinaryParticleData.H

3238707

Co-authored-by: Weiqun Zhang <[email protected]>

WeiqunZhang approved these changes Mar 6, 2024

View reviewed changes

WeiqunZhang assigned atmyers Mar 6, 2024

atmyers merged commit f1ef81e into AMReX-Codes:development Mar 6, 2024
68 of 69 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GPU restart for pure SoA particles #3783

Fix GPU restart for pure SoA particles #3783

atmyers commented Mar 1, 2024 •

edited

Loading

This comment was marked as outdated.

ax3l commented Mar 2, 2024

WeiqunZhang commented Mar 3, 2024

atmyers commented Mar 3, 2024

WeiqunZhang commented Mar 3, 2024

WeiqunZhang commented Mar 3, 2024

WeiqunZhang commented Mar 3, 2024

WeiqunZhang commented Mar 3, 2024

RevathiJambunathan commented Mar 3, 2024 •

edited

Loading

AlexanderSinn commented Mar 4, 2024

WeiqunZhang commented Mar 4, 2024

WeiqunZhang commented Mar 4, 2024

WeiqunZhang commented Mar 5, 2024

WeiqunZhang commented Mar 5, 2024

RevathiJambunathan commented Mar 5, 2024 •

edited

Loading

WeiqunZhang commented Mar 6, 2024 •

edited

Loading

WeiqunZhang commented Mar 6, 2024

ax3l commented Mar 7, 2024 •

edited

Loading

atmyers commented Mar 7, 2024

Fix GPU restart for pure SoA particles #3783

Fix GPU restart for pure SoA particles #3783

Conversation

atmyers commented Mar 1, 2024 • edited Loading

This comment was marked as outdated.

ax3l commented Mar 2, 2024

WeiqunZhang commented Mar 3, 2024

atmyers commented Mar 3, 2024

WeiqunZhang commented Mar 3, 2024

WeiqunZhang commented Mar 3, 2024

WeiqunZhang commented Mar 3, 2024

WeiqunZhang commented Mar 3, 2024

RevathiJambunathan commented Mar 3, 2024 • edited Loading

AlexanderSinn commented Mar 4, 2024

WeiqunZhang commented Mar 4, 2024

WeiqunZhang commented Mar 4, 2024

WeiqunZhang commented Mar 5, 2024

WeiqunZhang commented Mar 5, 2024

RevathiJambunathan commented Mar 5, 2024 • edited Loading

WeiqunZhang commented Mar 6, 2024 • edited Loading

WeiqunZhang commented Mar 6, 2024

ax3l commented Mar 7, 2024 • edited Loading

atmyers commented Mar 7, 2024

atmyers commented Mar 1, 2024 •

edited

Loading

RevathiJambunathan commented Mar 3, 2024 •

edited

Loading

RevathiJambunathan commented Mar 5, 2024 •

edited

Loading

WeiqunZhang commented Mar 6, 2024 •

edited

Loading

ax3l commented Mar 7, 2024 •

edited

Loading