200 x slower after restart #4701

pordyna · 2024-02-16T20:58:09Z

Problem with restarting

Hi!
I was setting up warpx simulations on perlmutter and run into a somewhat weird problem. My simulation run without any problem, but when I restart them from a checkpoint they become extremely slow (it jumps from 200ms to 40s per time step!) I don't know what is happening here. Restarting again after few more (slow) steps doesn't change anything, it stays around 40s per step. From the verbose output / profiler it looks like the most of the time is spent in collisions.
I tried switching on load balancing as well as disabling the sorting for deposition and switching to sorting into cells (probably better for collision anyway) but this didn't help.

The change in computation time always happens on the first restart, regardless of the time step so it can't be due to some rapid change in physics.

last step before checkpoint

STEP 5462 starts ...
--- INFO    : re-sorting particles
STEP 5462 ends. TIME = 1.025484411e-13 DT = 1.877488852e-17
Evolve time = 863.4138714 s; This step = 0.162532411 s; Avg. per step = 0.1580765052 s

first step after restart

STEP 5463 ends. TIME = 1.02567216e-13 DT = 1.877488852e-17
Evolve time = 30.37871414 s; This step = 30.37871414 s; Avg. per step = 30.37871414 s

here are my inputs and the full stdout and stderr input_output.zip

The text was updated successfully, but these errors were encountered:

ax3l · 2024-02-21T23:46:18Z

Thank you for the detailed report!

This looks like the load imbalance after restart is really off, with collision physics suddenly dominating the time step. With the 37 steps after restart, it looks like you have not yet reached a load balance step.

Can you try to add a load balance directly after restart, e.g.,

algo.load_balance_intervals = 1:1:1,5464:5464,100

and see if that helps?

pordyna · 2024-02-22T10:40:20Z

Hi, thanks for the suggestion, I will try this out. But the same thing happened when I did not have load balancing enabled at all. (Unless it is enabled by default?) So my understanding would be that in that case the simulation would continue with the same initial distribution?

pordyna · 2024-02-22T14:27:52Z

unfortunately this did not change anything

pordyna · 2024-02-22T18:59:26Z

Soo it looks like after one of the restarts I got to write another checkpoint together with the diagnostics. And so here are some exemplary fields at step 5499. This doesn't look very good...

pordyna · 2024-02-22T19:02:52Z

To be honest, it looks a bit like the subdomains are being swapped or misaligned during restart

ax3l · 2024-02-23T07:42:01Z

This looks very broken. Which exact version are you using of WarpX?

Thanks for the inputs files in the original issue, is there more we need to reproduce this restart bug?

pordyna · 2024-02-23T09:41:00Z

So it looks like I forgot to check out the latest release tag and was running from the development branch. To be specific from the following commit https://github.com/ECP-WarpX/WarpX/tree/a9d8126b500e1c7197eb0ed1e52fd50bb09cbdf4. Could this be the problem?
Here is the input file once again. warpex_used_inputs is missing quotes around analytical expressions and didn't work for resubmitting.
inputs_from_picmi.txt
And here is my environment:
perlmutter_gpu_warpx.profile.txt
And here is my dependencies install script
install_gpu_dependencies.sh.txt

I suppose the setup could be somewhat simplified and still reproduce the bug.

ax3l · 2024-02-23T18:11:19Z

Did you checkpoint and restart with the exact same version of WarpX?
Please try again if this occurs with the latest development version both writing and restarting?

pordyna · 2024-02-24T11:11:48Z

Yes it was all the same version. I was initially just testing automatic restart. Ok, I will recompile and check it .

pordyna · 2024-02-26T13:55:16Z

@ax3l This bug is still there when running on the newest development branch. To be exact on:

commit 9a017a67e5495263223da42db47657693b25bbd2 (HEAD -> development, origin/development, origin/HEAD)
Author: Eya D <[email protected]>
Date:   Fri Feb 23 21:55:45 2024 -0800

before checkpoint:

after checkpoint:

pordyna · 2024-02-26T14:07:20Z

Additionally, I have observed that some of my simulations crash just after writing checkpoint (but only sometimes) with the following error message:

amrex::Abort::6::CUDA error 700 in file /global/homes/p/pordyna/src/warpx/build_pm_gpu/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 598: an illegal memory access was encountered !!!
SIGABRT
See Backtrace.6 file for details
MPICH ERROR [Rank 6] [job id 22064439.0] [Fri Feb 23 05:39:47 2024] [nid002864] - Abort(6) (rank 6 in comm 496): application called MPI_Abort(comm=0x84000001, 6) - process 6

(Those simulations run with the previously mentioned https://github.com/ECP-WarpX/WarpX/tree/a9d8126b500e1c7197eb0ed1e52fd50bb09cbdf4)

the stdouts end like this:

STEP 487415 starts ...
STEP 487415 ends. TIME = 9.151162288e-12 DT = 1.877488852e-17
Evolve time = 59370.8985 s; This step = 0.115401544 s; Avg. per step = 0.1218076967 s

STEP 487416 starts ...
--- INFO    : re-sorting particles
--- INFO    : Writing openPMD file diags/particles00487416
--- INFO    : Writing checkpoint diags/checkpoint00487416
STEP 487416 ends. TIME = 9.151181063e-12 DT = 1.877488852e-17
Evolve time = 59376.77744 s; This step = 5.878937033 s; Avg. per step = 0.1218195083 s

STEP 487417 starts ...

It is always crashing in the next step. It is a bit confusing that it is not the same step (Do you write the checkpoint asynchronously to the execution ?).

I would say this suggests that the simulation is already writing corrupted checkpoints, probably accessing wrong part of the memory , and that sometimes results in an illegal memory access.

ax3l · 2024-02-29T17:37:17Z

I think this problem came in this month. X-ref #4735

Please use WarpX 24.02 or earlier for now until we fix it.

We plan to ship a fix in 24.03 with this PR: AMReX-Codes/amrex#3783

pordyna · 2024-03-01T12:13:09Z

@ax3l Did you mean 24.01, or really 23.01?

ax3l · 2024-03-03T00:41:12Z

I actually meant 24.02 :D

Can you confirm that version was still ok?

pordyna · 2024-03-04T11:56:50Z

Haha no problem, I will check it today and let you know.

pordyna · 2024-03-08T11:46:30Z

@ax3l 24.02 works fine, thanks for solving the issue

ax3l · 2024-03-08T13:39:45Z

Thanks for confirming!

@pordyna the WarpX 24.03 release as of #4759 should also fix this. Thanks for reporting this before the release! 🙏

Please let us know if 24.03 shows any issues for you and we reopen this.

ax3l added component: parallelization Guard cell exchanges and particle redistribution component: collisions Anything related to particle collisions component: checkpoint/restart Checkpointing & restarts component: load balancing Load balancing strategies, optimization etc. labels Feb 21, 2024

ax3l added bug Something isn't working bug: affects latest release Bug also exists in latest release version labels Feb 23, 2024

RemiLehe assigned ax3l Feb 27, 2024

ax3l assigned atmyers and unassigned ax3l Feb 29, 2024

ax3l removed the bug: affects latest release Bug also exists in latest release version label Feb 29, 2024

ax3l mentioned this issue Mar 1, 2024

checkpoint restart mixes up particle position data and loses particles #4735

Closed

ax3l closed this as completed Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

200 x slower after restart #4701

200 x slower after restart #4701

pordyna commented Feb 16, 2024

ax3l commented Feb 21, 2024

pordyna commented Feb 22, 2024

pordyna commented Feb 22, 2024

pordyna commented Feb 22, 2024

pordyna commented Feb 22, 2024 •

edited

Loading

ax3l commented Feb 23, 2024

pordyna commented Feb 23, 2024

ax3l commented Feb 23, 2024

pordyna commented Feb 24, 2024

pordyna commented Feb 26, 2024

pordyna commented Feb 26, 2024

ax3l commented Feb 29, 2024 •

edited

Loading

pordyna commented Mar 1, 2024 •

edited by ax3l

Loading

ax3l commented Mar 3, 2024 •

edited

Loading

pordyna commented Mar 4, 2024

pordyna commented Mar 8, 2024

ax3l commented Mar 8, 2024

200 x slower after restart #4701

200 x slower after restart #4701

Comments

pordyna commented Feb 16, 2024

Problem with restarting

last step before checkpoint

first step after restart

ax3l commented Feb 21, 2024

pordyna commented Feb 22, 2024

pordyna commented Feb 22, 2024

pordyna commented Feb 22, 2024

pordyna commented Feb 22, 2024 • edited Loading

ax3l commented Feb 23, 2024

pordyna commented Feb 23, 2024

ax3l commented Feb 23, 2024

pordyna commented Feb 24, 2024

pordyna commented Feb 26, 2024

pordyna commented Feb 26, 2024

ax3l commented Feb 29, 2024 • edited Loading

pordyna commented Mar 1, 2024 • edited by ax3l Loading

ax3l commented Mar 3, 2024 • edited Loading

pordyna commented Mar 4, 2024

pordyna commented Mar 8, 2024

ax3l commented Mar 8, 2024

pordyna commented Feb 22, 2024 •

edited

Loading

ax3l commented Feb 29, 2024 •

edited

Loading

pordyna commented Mar 1, 2024 •

edited by ax3l

Loading

ax3l commented Mar 3, 2024 •

edited

Loading