-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error running "test_world" on ALCF Theta #350
Comments
I'll try to reproduce on linux. Superficially it seems that destruction of
a WorldObject (that is deferred until the next global fence operation) has
been deferred past where there is sufficient state available to destroy
it. But it could be a more mundane problem.
…On Wed, Jun 10, 2020 at 11:14 AM Victor Anisimov ***@***.***> wrote:
I'm trying to run test_world on ALCF Theta but it crashes in the part of
testing multiple worlds. The stack trace of the failed execution is
attached.
export MPICH_MAX_THREAD_SAFETY=multiple
export MAD_NUM_THREADS=64
export KMP_AFFINITY=spread
export OMP_NUM_THREADS=64
export MAD_BUFFER_SIZE=128MB
aprun -n8 -N1 -d64 -j1 -cc depth ./test_world
I use MADNESS_REVISION "c6ec0e762374f9ae83636954c684dd33252ce993", which
is required for TiledArray/MPQC4, compiled Unger GNU PE, gcc/8.3.0 and
using 2019 Intel TBB.
Any idea what is going wrong?
stack.txt
<https://github.com/m-a-d-n-e-s-s/madness/files/4759456/stack.txt>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#350>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABZSAPPA5AFWEHXNLIRVSZDRV6POBANCNFSM4N2PNCMA>
.
--
Robert J. Harrison
tel: 865-274-8544
|
Could you try: |
I've run this once (same number of threads and processes) on a 72-core
intel box without a problem. Let me run it a few dozen more times and see
if I can trigger something.
…On Wed, Jun 10, 2020 at 12:11 PM Alvaro Vazquez-Mayagoitia < ***@***.***> wrote:
Could you try:
export MPICH_MAX_THREAD_SAFETY=multiple
export MAD_NUM_THREADS=8
export KMP_AFFINITY=spread
unset OMP_NUM_THREADS
export MAD_BUFFER_SIZE=128MB
aprun -n8 -N1 -cc none ./test_world
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#350 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABZSAPKTQTHA5QV7NHGHJHLRV6WEXANCNFSM4N2PNCMA>
.
--
Robert J. Harrison
tel: 865-274-8544
|
I tried running the job with different environment and aprun settings on Theta, and I see that the general behavior of "test_world" is largely the same. The job performs one or more repetitions and then hangs for 15 min; after that Madness runtime throws an exception and terminates the MPI processes. If I use 8 MPI tasks and 8 threads per node, the job hangs on the 2nd repetition. Increasing the number of threads from 8 to 64 while keeping the number of MPI tasks the same (8) as before, made the job progressing a little further, and this time it hung on 4th repetition. Here are the outputs for the 0th and 1st processes It appears that making all 10 repetitions pass is problematic on Theta. Trying to see if increasing the number of MPI tasks can change anything significantly. |
That's helpful.
Once it hangs there is a circa 20min timeout that is there to detect such
scenario.
Can u attach a debugger and see where it is hanging before the timeout
kicks in?
…On Wed, Jun 10, 2020, 5:26 PM Victor Anisimov ***@***.***> wrote:
I tried running the job with different environment and aprun settings on
Theta, and I see that the general behavior of "test_world" is largely the
same. The job performs one or more repetitions and then hangs for 15 min;
after that Madness runtime throws an exception and terminates the MPI
processes.
If I use 8 MPI tasks and 8 threads per node, the job hangs on the 2nd
repetition.
The log files for 0th and 1st processes are
log-08.00000.txt
<https://github.com/m-a-d-n-e-s-s/madness/files/4761279/log-08.00000.txt>
and
log-08.00001.txt
<https://github.com/m-a-d-n-e-s-s/madness/files/4761282/log-08.00001.txt>
Increasing the number of threads from 8 to 64 while keeping the number of
MPI tasks the same (8) as before, made the job progressing a little
further, and this time it hung on 4th repetition. Here are the outputs for
the 0th and 1st processes
log-64.00000.txt
<https://github.com/m-a-d-n-e-s-s/madness/files/4761299/log-64.00000.txt>
and
log-64.00001.txt
<https://github.com/m-a-d-n-e-s-s/madness/files/4761302/log-64.00001.txt>
It appears that making all 10 repetitions pass is problematic on Theta.
Trying to see if increasing the number of MPI tasks can change anything
significantly.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#350 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABZSAPIHQQWIOGZDOO35EJLRV73BBANCNFSM4N2PNCMA>
.
|
Running the test job under a debugger showed that the hung is actually an infinite loop in madness/world/thread.h. Each process loops through the lines: 1438, 1440, 1441, 1443, 1448, 1458, 1477, and back again to line 1438, and so on. I checked that pattern with MPI task 0 and MPI task 1. Other processes must be doing the same. Any suggestion how to determine the cause of the infinite loop? To avoid discrepancy in different branches of the code, I'm attaching a copy of thread.h |
Thanks that helps a little but could I get a traceback of the main thread
for process 0?
The loop you found is just waiting for work to complete that has apparently
been lost somewhere.
…On Thu, Jun 11, 2020, 1:13 AM Victor Anisimov ***@***.***> wrote:
Running the test job under a debugger showed that the hung is actually an
infinite loop in madness/world/thread.h. Each process loops through the
lines: 1438, 1440, 1441, 1443, 1448, 1458, 1477, and back again to line
1438, and so on. I checked that pattern with MPI task 0 and MPI task 1.
Other processes must be doing the same. Any suggestion how to determine the
cause of the infinite loop?
To avoid discrepancy in different branches of the code, I'm attaching a
copy of thread.h
thread.h.txt
<https://github.com/m-a-d-n-e-s-s/madness/files/4762688/thread.h.txt>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#350 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABZSAPIR3RM7VRVPKL6DBPTRWBRVZANCNFSM4N2PNCMA>
.
|
Here is the stack trace of process 0 when it enters an infinite loop. |
It is apparently test11 that causes the trouble. If I comment it out, the entire workload of test_world including 10 repetitions successfully complete. |
Fantastic. That helps.
Can you please rebuild with TBB disabled?
…-DENABLE_TBB=OFF
On Thu, Jun 11, 2020 at 11:45 AM Victor Anisimov ***@***.***> wrote:
It is apparently test11 that causes the trouble. If I comment it out, the
entire workload of test_world including 10 repetitions successfully
complete.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#350 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABZSAPMVP5MAF532KCXLI3TRWD3ZPANCNFSM4N2PNCMA>
.
--
Robert J. Harrison
tel: 865-274-8544
|
I recompiled everything with ENABLE_TBB=OFF, and reran the test "test_world" under debugger. The job hanged (entered in infinite loop) again on the second repetition. The stack trace collected when the job stopped making progress is attached. |
The original test_world.cc has the line mpiexec -n 8 -ppn 4 ./test_world log.00000:Split_color: iam = 0 node = iris08.ftm.alcf.anl.gov Apparently, the invoked method should be Split_type() rather than Split() in this place. log.00000:Split_type: iam = 0 node = iris08.ftm.alcf.anl.gov Multiple communicators in test11 still hang so I'm continuing to debug that issue. |
Victor ... thanks for the info.
I have a fix we are testing for test11. I would not invest more time in
that until we push that fix.
Which MPI are you using here? OpenMPI?
…On Tue, Jul 21, 2020 at 10:44 AM Victor Anisimov ***@***.***> wrote:
The original test_world.cc has the line
SafeMPI::Intracomm comm =
world.mpi.comm().Split(SafeMPI::Intracomm::SHARED_SPLIT_TYPE, 0);
that does not do any splitting. I added some print statements to
illustrate that no rank splitting happens here.
mpiexec -n 8 -ppn 4 ./test_world
log.00000:Split_color: iam = 0 node = iris08.ftm.alcf.anl.gov
log.00001:Split_color: iam = 1 node = iris08.ftm.alcf.anl.gov
log.00002:Split_color: iam = 2 node = iris08.ftm.alcf.anl.gov
log.00003:Split_color: iam = 3 node = iris08.ftm.alcf.anl.gov
log.00004:Split_color: iam = 4 node = iris18.ftm.alcf.anl.gov
log.00005:Split_color: iam = 5 node = iris18.ftm.alcf.anl.gov
log.00006:Split_color: iam = 6 node = iris18.ftm.alcf.anl.gov
log.00007:Split_color: iam = 7 node = iris18.ftm.alcf.anl.gov
Apparently, the invoked method should be Split_type() rather than Split()
in this place.
SafeMPI::Intracomm comm =
world.mpi.comm().Split_type(SafeMPI::Intracomm::SHARED_SPLIT_TYPE, 0);
In this form the code performs correct splitting of the ranks creating a
local communicator on each of the two available nodes.
log.00000:Split_type: iam = 0 node = iris08.ftm.alcf.anl.gov
log.00001:Split_type: iam = 1 node = iris08.ftm.alcf.anl.gov
log.00002:Split_type: iam = 2 node = iris08.ftm.alcf.anl.gov
log.00003:Split_type: iam = 3 node = iris08.ftm.alcf.anl.gov
log.00004:Split_type: iam = 0 node = iris18.ftm.alcf.anl.gov
log.00005:Split_type: iam = 1 node = iris18.ftm.alcf.anl.gov
log.00006:Split_type: iam = 2 node = iris18.ftm.alcf.anl.gov
log.00007:Split_type: iam = 3 node = iris18.ftm.alcf.anl.gov
Multiple communicators in test11 still hang so I'm continuing to debug
that issue.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#350 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABZSAPNRZY5JI3SDNKUXW3DR4WSVLANCNFSM4N2PNCMA>
.
--
Robert J. Harrison
tel: 865-274-8544
|
I'm using MPICH. |
I'm trying to run test_world on ALCF Theta but it crashes in the part of testing multiple worlds. The stack trace of the failed execution is attached.
export MPICH_MAX_THREAD_SAFETY=multiple
export MAD_NUM_THREADS=64
export KMP_AFFINITY=spread
export OMP_NUM_THREADS=64
export MAD_BUFFER_SIZE=128MB
aprun -n8 -N1 -d64 -j1 -cc depth ./test_world
I use MADNESS_REVISION "c6ec0e762374f9ae83636954c684dd33252ce993", which is required for TiledArray/MPQC4, compiled Unger GNU PE, gcc/8.3.0 and using 2019 Intel TBB.
Any idea what is going wrong?
stack.txt
The text was updated successfully, but these errors were encountered: