-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
200 x slower after restart #4701
Comments
Thank you for the detailed report! This looks like the load imbalance after restart is really off, with collision physics suddenly dominating the time step. With the 37 steps after restart, it looks like you have not yet reached a load balance step. Can you try to add a load balance directly after restart, e.g.,
and see if that helps? |
Hi, thanks for the suggestion, I will try this out. But the same thing happened when I did not have load balancing enabled at all. (Unless it is enabled by default?) So my understanding would be that in that case the simulation would continue with the same initial distribution? |
unfortunately this did not change anything |
To be honest, it looks a bit like the subdomains are being swapped or misaligned during restart |
This looks very broken. Which exact version are you using of WarpX? Thanks for the inputs files in the original issue, is there more we need to reproduce this restart bug? |
So it looks like I forgot to check out the latest release tag and was running from the development branch. To be specific from the following commit https://github.com/ECP-WarpX/WarpX/tree/a9d8126b500e1c7197eb0ed1e52fd50bb09cbdf4. Could this be the problem? I suppose the setup could be somewhat simplified and still reproduce the bug. |
Did you checkpoint and restart with the exact same version of WarpX? |
Yes it was all the same version. I was initially just testing automatic restart. Ok, I will recompile and check it . |
@ax3l This bug is still there when running on the newest development branch. To be exact on:
|
Additionally, I have observed that some of my simulations crash just after writing checkpoint (but only sometimes) with the following error message:
(Those simulations run with the previously mentioned https://github.com/ECP-WarpX/WarpX/tree/a9d8126b500e1c7197eb0ed1e52fd50bb09cbdf4) the stdouts end like this:
It is always crashing in the next step. It is a bit confusing that it is not the same step (Do you write the checkpoint asynchronously to the execution ?). I would say this suggests that the simulation is already writing corrupted checkpoints, probably accessing wrong part of the memory , and that sometimes results in an illegal memory access. |
I think this problem came in this month. X-ref #4735 Please use WarpX We plan to ship a fix in |
@ax3l Did you mean 24.01, or really 23.01? |
I actually meant Can you confirm that version was still ok? |
Haha no problem, I will check it today and let you know. |
@ax3l |
Problem with restarting
Hi!
I was setting up warpx simulations on perlmutter and run into a somewhat weird problem. My simulation run without any problem, but when I restart them from a checkpoint they become extremely slow (it jumps from 200ms to 40s per time step!) I don't know what is happening here. Restarting again after few more (slow) steps doesn't change anything, it stays around 40s per step. From the verbose output / profiler it looks like the most of the time is spent in collisions.
I tried switching on load balancing as well as disabling the sorting for deposition and switching to sorting into cells (probably better for collision anyway) but this didn't help.
The change in computation time always happens on the first restart, regardless of the time step so it can't be due to some rapid change in physics.
last step before checkpoint
first step after restart
here are my inputs and the full stdout and stderr input_output.zip
The text was updated successfully, but these errors were encountered: