Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

200 x slower after restart #4701

Closed
pordyna opened this issue Feb 16, 2024 · 17 comments
Closed

200 x slower after restart #4701

pordyna opened this issue Feb 16, 2024 · 17 comments
Assignees
Labels
bug Something isn't working component: checkpoint/restart Checkpointing & restarts component: collisions Anything related to particle collisions component: load balancing Load balancing strategies, optimization etc. component: parallelization Guard cell exchanges and particle redistribution

Comments

@pordyna
Copy link
Contributor

pordyna commented Feb 16, 2024

Problem with restarting

Hi!
I was setting up warpx simulations on perlmutter and run into a somewhat weird problem. My simulation run without any problem, but when I restart them from a checkpoint they become extremely slow (it jumps from 200ms to 40s per time step!) I don't know what is happening here. Restarting again after few more (slow) steps doesn't change anything, it stays around 40s per step. From the verbose output / profiler it looks like the most of the time is spent in collisions.
I tried switching on load balancing as well as disabling the sorting for deposition and switching to sorting into cells (probably better for collision anyway) but this didn't help.

The change in computation time always happens on the first restart, regardless of the time step so it can't be due to some rapid change in physics.

last step before checkpoint

STEP 5462 starts ...
--- INFO    : re-sorting particles
STEP 5462 ends. TIME = 1.025484411e-13 DT = 1.877488852e-17
Evolve time = 863.4138714 s; This step = 0.162532411 s; Avg. per step = 0.1580765052 s

first step after restart

STEP 5463 ends. TIME = 1.02567216e-13 DT = 1.877488852e-17
Evolve time = 30.37871414 s; This step = 30.37871414 s; Avg. per step = 30.37871414 s

here are my inputs and the full stdout and stderr input_output.zip

@ax3l
Copy link
Member

ax3l commented Feb 21, 2024

Thank you for the detailed report!

This looks like the load imbalance after restart is really off, with collision physics suddenly dominating the time step. With the 37 steps after restart, it looks like you have not yet reached a load balance step.

Can you try to add a load balance directly after restart, e.g.,

algo.load_balance_intervals = 1:1:1,5464:5464,100

and see if that helps?

@ax3l ax3l added component: parallelization Guard cell exchanges and particle redistribution component: collisions Anything related to particle collisions component: checkpoint/restart Checkpointing & restarts component: load balancing Load balancing strategies, optimization etc. labels Feb 21, 2024
@pordyna
Copy link
Contributor Author

pordyna commented Feb 22, 2024

Hi, thanks for the suggestion, I will try this out. But the same thing happened when I did not have load balancing enabled at all. (Unless it is enabled by default?) So my understanding would be that in that case the simulation would continue with the same initial distribution?

@pordyna
Copy link
Contributor Author

pordyna commented Feb 22, 2024

unfortunately this did not change anything

@pordyna
Copy link
Contributor Author

pordyna commented Feb 22, 2024

Soo it looks like after one of the restarts I got to write another checkpoint together with the diagnostics. And so here are some exemplary fields at step 5499. This doesn't look very good...

Figure 10
Figure 9(1)

@pordyna
Copy link
Contributor Author

pordyna commented Feb 22, 2024

To be honest, it looks a bit like the subdomains are being swapped or misaligned during restart

@ax3l ax3l added bug Something isn't working bug: affects latest release Bug also exists in latest release version labels Feb 23, 2024
@ax3l
Copy link
Member

ax3l commented Feb 23, 2024

This looks very broken. Which exact version are you using of WarpX?

Thanks for the inputs files in the original issue, is there more we need to reproduce this restart bug?

@pordyna
Copy link
Contributor Author

pordyna commented Feb 23, 2024

So it looks like I forgot to check out the latest release tag and was running from the development branch. To be specific from the following commit https://github.com/ECP-WarpX/WarpX/tree/a9d8126b500e1c7197eb0ed1e52fd50bb09cbdf4. Could this be the problem?
Here is the input file once again. warpex_used_inputs is missing quotes around analytical expressions and didn't work for resubmitting.
inputs_from_picmi.txt
And here is my environment:
perlmutter_gpu_warpx.profile.txt
And here is my dependencies install script
install_gpu_dependencies.sh.txt

I suppose the setup could be somewhat simplified and still reproduce the bug.

@ax3l
Copy link
Member

ax3l commented Feb 23, 2024

Did you checkpoint and restart with the exact same version of WarpX?
Please try again if this occurs with the latest development version both writing and restarting?

@pordyna
Copy link
Contributor Author

pordyna commented Feb 24, 2024

Yes it was all the same version. I was initially just testing automatic restart. Ok, I will recompile and check it .

@pordyna
Copy link
Contributor Author

pordyna commented Feb 26, 2024

@ax3l This bug is still there when running on the newest development branch. To be exact on:

commit 9a017a67e5495263223da42db47657693b25bbd2 (HEAD -> development, origin/development, origin/HEAD)
Author: Eya D <[email protected]>
Date:   Fri Feb 23 21:55:45 2024 -0800

before checkpoint:
Figure 11
after checkpoint:
Figure 12

@pordyna
Copy link
Contributor Author

pordyna commented Feb 26, 2024

Additionally, I have observed that some of my simulations crash just after writing checkpoint (but only sometimes) with the following error message:

amrex::Abort::6::CUDA error 700 in file /global/homes/p/pordyna/src/warpx/build_pm_gpu/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 598: an illegal memory access was encountered !!!
SIGABRT
See Backtrace.6 file for details
MPICH ERROR [Rank 6] [job id 22064439.0] [Fri Feb 23 05:39:47 2024] [nid002864] - Abort(6) (rank 6 in comm 496): application called MPI_Abort(comm=0x84000001, 6) - process 6

(Those simulations run with the previously mentioned https://github.com/ECP-WarpX/WarpX/tree/a9d8126b500e1c7197eb0ed1e52fd50bb09cbdf4)

the stdouts end like this:

STEP 487415 starts ...
STEP 487415 ends. TIME = 9.151162288e-12 DT = 1.877488852e-17
Evolve time = 59370.8985 s; This step = 0.115401544 s; Avg. per step = 0.1218076967 s

STEP 487416 starts ...
--- INFO    : re-sorting particles
--- INFO    : Writing openPMD file diags/particles00487416
--- INFO    : Writing checkpoint diags/checkpoint00487416
STEP 487416 ends. TIME = 9.151181063e-12 DT = 1.877488852e-17
Evolve time = 59376.77744 s; This step = 5.878937033 s; Avg. per step = 0.1218195083 s

STEP 487417 starts ...

It is always crashing in the next step. It is a bit confusing that it is not the same step (Do you write the checkpoint asynchronously to the execution ?).

I would say this suggests that the simulation is already writing corrupted checkpoints, probably accessing wrong part of the memory , and that sometimes results in an illegal memory access.

@ax3l ax3l assigned atmyers and unassigned ax3l Feb 29, 2024
@ax3l
Copy link
Member

ax3l commented Feb 29, 2024

I think this problem came in this month. X-ref #4735

Please use WarpX 24.02 or earlier for now until we fix it.

We plan to ship a fix in 24.03 with this PR: AMReX-Codes/amrex#3783

@pordyna
Copy link
Contributor Author

pordyna commented Mar 1, 2024

@ax3l Did you mean 24.01, or really 23.01?

@ax3l
Copy link
Member

ax3l commented Mar 3, 2024

I actually meant 24.02 :D

Can you confirm that version was still ok?

@pordyna
Copy link
Contributor Author

pordyna commented Mar 4, 2024

Haha no problem, I will check it today and let you know.

@pordyna
Copy link
Contributor Author

pordyna commented Mar 8, 2024

@ax3l 24.02 works fine, thanks for solving the issue

@ax3l
Copy link
Member

ax3l commented Mar 8, 2024

Thanks for confirming!

@pordyna the WarpX 24.03 release as of #4759 should also fix this. Thanks for reporting this before the release! 🙏

Please let us know if 24.03 shows any issues for you and we reopen this.

@ax3l ax3l closed this as completed Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component: checkpoint/restart Checkpointing & restarts component: collisions Anything related to particle collisions component: load balancing Load balancing strategies, optimization etc. component: parallelization Guard cell exchanges and particle redistribution
Projects
None yet
Development

No branches or pull requests

3 participants