Error running "test_world" on ALCF Theta #350

victor-anisimov · 2020-06-10T15:14:24Z

I'm trying to run test_world on ALCF Theta but it crashes in the part of testing multiple worlds. The stack trace of the failed execution is attached.

export MPICH_MAX_THREAD_SAFETY=multiple
export MAD_NUM_THREADS=64
export KMP_AFFINITY=spread
export OMP_NUM_THREADS=64
export MAD_BUFFER_SIZE=128MB
aprun -n8 -N1 -d64 -j1 -cc depth ./test_world

I use MADNESS_REVISION "c6ec0e762374f9ae83636954c684dd33252ce993", which is required for TiledArray/MPQC4, compiled Unger GNU PE, gcc/8.3.0 and using 2019 Intel TBB.

Any idea what is going wrong?
stack.txt

robertjharrison · 2020-06-10T16:10:05Z

I'll try to reproduce on linux. Superficially it seems that destruction of a WorldObject (that is deferred until the next global fence operation) has been deferred past where there is sufficient state available to destroy it. But it could be a more mundane problem.

…

On Wed, Jun 10, 2020 at 11:14 AM Victor Anisimov ***@***.***> wrote: I'm trying to run test_world on ALCF Theta but it crashes in the part of testing multiple worlds. The stack trace of the failed execution is attached. export MPICH_MAX_THREAD_SAFETY=multiple export MAD_NUM_THREADS=64 export KMP_AFFINITY=spread export OMP_NUM_THREADS=64 export MAD_BUFFER_SIZE=128MB aprun -n8 -N1 -d64 -j1 -cc depth ./test_world I use MADNESS_REVISION "c6ec0e762374f9ae83636954c684dd33252ce993", which is required for TiledArray/MPQC4, compiled Unger GNU PE, gcc/8.3.0 and using 2019 Intel TBB. Any idea what is going wrong? stack.txt <https://github.com/m-a-d-n-e-s-s/madness/files/4759456/stack.txt> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#350>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZSAPPA5AFWEHXNLIRVSZDRV6POBANCNFSM4N2PNCMA> .

-- Robert J. Harrison tel: 865-274-8544

alvarovm · 2020-06-10T16:11:41Z

Could you try:
export MPICH_MAX_THREAD_SAFETY=multiple
export MAD_NUM_THREADS=8
export KMP_AFFINITY=spread
unset OMP_NUM_THREADS
export MAD_BUFFER_SIZE=128MB
aprun -n8 -N1 -cc none ./test_world

robertjharrison · 2020-06-10T16:50:51Z

I've run this once (same number of threads and processes) on a 72-core intel box without a problem. Let me run it a few dozen more times and see if I can trigger something.

…

On Wed, Jun 10, 2020 at 12:11 PM Alvaro Vazquez-Mayagoitia < ***@***.***> wrote: Could you try: export MPICH_MAX_THREAD_SAFETY=multiple export MAD_NUM_THREADS=8 export KMP_AFFINITY=spread unset OMP_NUM_THREADS export MAD_BUFFER_SIZE=128MB aprun -n8 -N1 -cc none ./test_world — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#350 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZSAPKTQTHA5QV7NHGHJHLRV6WEXANCNFSM4N2PNCMA> .

-- Robert J. Harrison tel: 865-274-8544

victor-anisimov · 2020-06-10T21:26:27Z

I tried running the job with different environment and aprun settings on Theta, and I see that the general behavior of "test_world" is largely the same. The job performs one or more repetitions and then hangs for 15 min; after that Madness runtime throws an exception and terminates the MPI processes.

If I use 8 MPI tasks and 8 threads per node, the job hangs on the 2nd repetition.
The log files for 0th and 1st processes are
log-08.00000.txt
and
log-08.00001.txt

Increasing the number of threads from 8 to 64 while keeping the number of MPI tasks the same (8) as before, made the job progressing a little further, and this time it hung on 4th repetition. Here are the outputs for the 0th and 1st processes
log-64.00000.txt
and
log-64.00001.txt

It appears that making all 10 repetitions pass is problematic on Theta. Trying to see if increasing the number of MPI tasks can change anything significantly.

robertjharrison · 2020-06-11T01:56:38Z

That's helpful. Once it hangs there is a circa 20min timeout that is there to detect such scenario. Can u attach a debugger and see where it is hanging before the timeout kicks in?

…

On Wed, Jun 10, 2020, 5:26 PM Victor Anisimov ***@***.***> wrote: I tried running the job with different environment and aprun settings on Theta, and I see that the general behavior of "test_world" is largely the same. The job performs one or more repetitions and then hangs for 15 min; after that Madness runtime throws an exception and terminates the MPI processes. If I use 8 MPI tasks and 8 threads per node, the job hangs on the 2nd repetition. The log files for 0th and 1st processes are log-08.00000.txt <https://github.com/m-a-d-n-e-s-s/madness/files/4761279/log-08.00000.txt> and log-08.00001.txt <https://github.com/m-a-d-n-e-s-s/madness/files/4761282/log-08.00001.txt> Increasing the number of threads from 8 to 64 while keeping the number of MPI tasks the same (8) as before, made the job progressing a little further, and this time it hung on 4th repetition. Here are the outputs for the 0th and 1st processes log-64.00000.txt <https://github.com/m-a-d-n-e-s-s/madness/files/4761299/log-64.00000.txt> and log-64.00001.txt <https://github.com/m-a-d-n-e-s-s/madness/files/4761302/log-64.00001.txt> It appears that making all 10 repetitions pass is problematic on Theta. Trying to see if increasing the number of MPI tasks can change anything significantly. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#350 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZSAPIHQQWIOGZDOO35EJLRV73BBANCNFSM4N2PNCMA> .

victor-anisimov · 2020-06-11T05:12:48Z

Running the test job under a debugger showed that the hung is actually an infinite loop in madness/world/thread.h. Each process loops through the lines: 1438, 1440, 1441, 1443, 1448, 1458, 1477, and back again to line 1438, and so on. I checked that pattern with MPI task 0 and MPI task 1. Other processes must be doing the same. Any suggestion how to determine the cause of the infinite loop?

To avoid discrepancy in different branches of the code, I'm attaching a copy of thread.h
thread.h.txt

robertjharrison · 2020-06-11T11:55:35Z

Thanks that helps a little but could I get a traceback of the main thread for process 0? The loop you found is just waiting for work to complete that has apparently been lost somewhere.

…

On Thu, Jun 11, 2020, 1:13 AM Victor Anisimov ***@***.***> wrote: Running the test job under a debugger showed that the hung is actually an infinite loop in madness/world/thread.h. Each process loops through the lines: 1438, 1440, 1441, 1443, 1448, 1458, 1477, and back again to line 1438, and so on. I checked that pattern with MPI task 0 and MPI task 1. Other processes must be doing the same. Any suggestion how to determine the cause of the infinite loop? To avoid discrepancy in different branches of the code, I'm attaching a copy of thread.h thread.h.txt <https://github.com/m-a-d-n-e-s-s/madness/files/4762688/thread.h.txt> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#350 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZSAPIR3RM7VRVPKL6DBPTRWBRVZANCNFSM4N2PNCMA> .

victor-anisimov · 2020-06-11T13:44:45Z

Here is the stack trace of process 0 when it enters an infinite loop.
stack-2.txt

victor-anisimov · 2020-06-11T15:45:12Z

It is apparently test11 that causes the trouble. If I comment it out, the entire workload of test_world including 10 repetitions successfully complete.

robertjharrison · 2020-06-11T16:30:25Z

Fantastic. That helps. Can you please rebuild with TBB disabled?

…

-DENABLE_TBB=OFF

On Thu, Jun 11, 2020 at 11:45 AM Victor Anisimov ***@***.***> wrote: It is apparently test11 that causes the trouble. If I comment it out, the entire workload of test_world including 10 repetitions successfully complete. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#350 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZSAPMVP5MAF532KCXLI3TRWD3ZPANCNFSM4N2PNCMA> .

-- Robert J. Harrison tel: 865-274-8544

victor-anisimov · 2020-06-11T19:30:47Z

I recompiled everything with ENABLE_TBB=OFF, and reran the test "test_world" under debugger. The job hanged (entered in infinite loop) again on the second repetition. The stack trace collected when the job stopped making progress is attached.
stack-3.txt

victor-anisimov · 2020-07-21T14:44:19Z

The original test_world.cc has the line
SafeMPI::Intracomm comm = world.mpi.comm().Split(SafeMPI::Intracomm::SHARED_SPLIT_TYPE, 0);
that does not do any splitting. I added some print statements to illustrate that no rank splitting happens here.

mpiexec -n 8 -ppn 4 ./test_world

log.00000:Split_color: iam = 0 node = iris08.ftm.alcf.anl.gov
log.00001:Split_color: iam = 1 node = iris08.ftm.alcf.anl.gov
log.00002:Split_color: iam = 2 node = iris08.ftm.alcf.anl.gov
log.00003:Split_color: iam = 3 node = iris08.ftm.alcf.anl.gov
log.00004:Split_color: iam = 4 node = iris18.ftm.alcf.anl.gov
log.00005:Split_color: iam = 5 node = iris18.ftm.alcf.anl.gov
log.00006:Split_color: iam = 6 node = iris18.ftm.alcf.anl.gov
log.00007:Split_color: iam = 7 node = iris18.ftm.alcf.anl.gov

Apparently, the invoked method should be Split_type() rather than Split() in this place.
SafeMPI::Intracomm comm = world.mpi.comm().Split_type(SafeMPI::Intracomm::SHARED_SPLIT_TYPE, 0);
In this form the code performs correct splitting of the ranks creating a local communicator on each of the two available nodes.

log.00000:Split_type: iam = 0 node = iris08.ftm.alcf.anl.gov
log.00001:Split_type: iam = 1 node = iris08.ftm.alcf.anl.gov
log.00002:Split_type: iam = 2 node = iris08.ftm.alcf.anl.gov
log.00003:Split_type: iam = 3 node = iris08.ftm.alcf.anl.gov
log.00004:Split_type: iam = 0 node = iris18.ftm.alcf.anl.gov
log.00005:Split_type: iam = 1 node = iris18.ftm.alcf.anl.gov
log.00006:Split_type: iam = 2 node = iris18.ftm.alcf.anl.gov
log.00007:Split_type: iam = 3 node = iris18.ftm.alcf.anl.gov

Multiple communicators in test11 still hang so I'm continuing to debug that issue.

robertjharrison · 2020-07-21T14:59:33Z

Victor ... thanks for the info. I have a fix we are testing for test11. I would not invest more time in that until we push that fix. Which MPI are you using here? OpenMPI?

…

On Tue, Jul 21, 2020 at 10:44 AM Victor Anisimov ***@***.***> wrote: The original test_world.cc has the line SafeMPI::Intracomm comm = world.mpi.comm().Split(SafeMPI::Intracomm::SHARED_SPLIT_TYPE, 0); that does not do any splitting. I added some print statements to illustrate that no rank splitting happens here. mpiexec -n 8 -ppn 4 ./test_world log.00000:Split_color: iam = 0 node = iris08.ftm.alcf.anl.gov log.00001:Split_color: iam = 1 node = iris08.ftm.alcf.anl.gov log.00002:Split_color: iam = 2 node = iris08.ftm.alcf.anl.gov log.00003:Split_color: iam = 3 node = iris08.ftm.alcf.anl.gov log.00004:Split_color: iam = 4 node = iris18.ftm.alcf.anl.gov log.00005:Split_color: iam = 5 node = iris18.ftm.alcf.anl.gov log.00006:Split_color: iam = 6 node = iris18.ftm.alcf.anl.gov log.00007:Split_color: iam = 7 node = iris18.ftm.alcf.anl.gov Apparently, the invoked method should be Split_type() rather than Split() in this place. SafeMPI::Intracomm comm = world.mpi.comm().Split_type(SafeMPI::Intracomm::SHARED_SPLIT_TYPE, 0); In this form the code performs correct splitting of the ranks creating a local communicator on each of the two available nodes. log.00000:Split_type: iam = 0 node = iris08.ftm.alcf.anl.gov log.00001:Split_type: iam = 1 node = iris08.ftm.alcf.anl.gov log.00002:Split_type: iam = 2 node = iris08.ftm.alcf.anl.gov log.00003:Split_type: iam = 3 node = iris08.ftm.alcf.anl.gov log.00004:Split_type: iam = 0 node = iris18.ftm.alcf.anl.gov log.00005:Split_type: iam = 1 node = iris18.ftm.alcf.anl.gov log.00006:Split_type: iam = 2 node = iris18.ftm.alcf.anl.gov log.00007:Split_type: iam = 3 node = iris18.ftm.alcf.anl.gov Multiple communicators in test11 still hang so I'm continuing to debug that issue. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#350 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZSAPNRZY5JI3SDNKUXW3DR4WSVLANCNFSM4N2PNCMA> .

-- Robert J. Harrison tel: 865-274-8544

victor-anisimov · 2020-07-21T15:26:29Z

I'm using MPICH.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error running "test_world" on ALCF Theta #350

Error running "test_world" on ALCF Theta #350

victor-anisimov commented Jun 10, 2020

robertjharrison commented Jun 10, 2020 via email

alvarovm commented Jun 10, 2020

robertjharrison commented Jun 10, 2020 via email

victor-anisimov commented Jun 10, 2020

robertjharrison commented Jun 11, 2020 via email

victor-anisimov commented Jun 11, 2020

robertjharrison commented Jun 11, 2020 via email

victor-anisimov commented Jun 11, 2020

victor-anisimov commented Jun 11, 2020

robertjharrison commented Jun 11, 2020 via email

victor-anisimov commented Jun 11, 2020 •

edited

Loading

victor-anisimov commented Jul 21, 2020

robertjharrison commented Jul 21, 2020 via email

victor-anisimov commented Jul 21, 2020

Error running "test_world" on ALCF Theta #350

Error running "test_world" on ALCF Theta #350

Comments

victor-anisimov commented Jun 10, 2020

robertjharrison commented Jun 10, 2020 via email

alvarovm commented Jun 10, 2020

robertjharrison commented Jun 10, 2020 via email

victor-anisimov commented Jun 10, 2020

robertjharrison commented Jun 11, 2020 via email

victor-anisimov commented Jun 11, 2020

robertjharrison commented Jun 11, 2020 via email

victor-anisimov commented Jun 11, 2020

victor-anisimov commented Jun 11, 2020

robertjharrison commented Jun 11, 2020 via email

victor-anisimov commented Jun 11, 2020 • edited Loading

victor-anisimov commented Jul 21, 2020

robertjharrison commented Jul 21, 2020 via email

victor-anisimov commented Jul 21, 2020

victor-anisimov commented Jun 11, 2020 •

edited

Loading