Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running "test_world" on ALCF Theta #350

Open
victor-anisimov opened this issue Jun 10, 2020 · 14 comments
Open

Error running "test_world" on ALCF Theta #350

victor-anisimov opened this issue Jun 10, 2020 · 14 comments

Comments

@victor-anisimov
Copy link

I'm trying to run test_world on ALCF Theta but it crashes in the part of testing multiple worlds. The stack trace of the failed execution is attached.

export MPICH_MAX_THREAD_SAFETY=multiple
export MAD_NUM_THREADS=64
export KMP_AFFINITY=spread
export OMP_NUM_THREADS=64
export MAD_BUFFER_SIZE=128MB
aprun -n8 -N1 -d64 -j1 -cc depth ./test_world

I use MADNESS_REVISION "c6ec0e762374f9ae83636954c684dd33252ce993", which is required for TiledArray/MPQC4, compiled Unger GNU PE, gcc/8.3.0 and using 2019 Intel TBB.

Any idea what is going wrong?
stack.txt

@robertjharrison
Copy link
Contributor

robertjharrison commented Jun 10, 2020 via email

@alvarovm
Copy link
Contributor

Could you try:
export MPICH_MAX_THREAD_SAFETY=multiple
export MAD_NUM_THREADS=8
export KMP_AFFINITY=spread
unset OMP_NUM_THREADS
export MAD_BUFFER_SIZE=128MB
aprun -n8 -N1 -cc none ./test_world

@robertjharrison
Copy link
Contributor

robertjharrison commented Jun 10, 2020 via email

@victor-anisimov
Copy link
Author

I tried running the job with different environment and aprun settings on Theta, and I see that the general behavior of "test_world" is largely the same. The job performs one or more repetitions and then hangs for 15 min; after that Madness runtime throws an exception and terminates the MPI processes.

If I use 8 MPI tasks and 8 threads per node, the job hangs on the 2nd repetition.
The log files for 0th and 1st processes are
log-08.00000.txt
and
log-08.00001.txt

Increasing the number of threads from 8 to 64 while keeping the number of MPI tasks the same (8) as before, made the job progressing a little further, and this time it hung on 4th repetition. Here are the outputs for the 0th and 1st processes
log-64.00000.txt
and
log-64.00001.txt

It appears that making all 10 repetitions pass is problematic on Theta. Trying to see if increasing the number of MPI tasks can change anything significantly.

@robertjharrison
Copy link
Contributor

robertjharrison commented Jun 11, 2020 via email

@victor-anisimov
Copy link
Author

Running the test job under a debugger showed that the hung is actually an infinite loop in madness/world/thread.h. Each process loops through the lines: 1438, 1440, 1441, 1443, 1448, 1458, 1477, and back again to line 1438, and so on. I checked that pattern with MPI task 0 and MPI task 1. Other processes must be doing the same. Any suggestion how to determine the cause of the infinite loop?

To avoid discrepancy in different branches of the code, I'm attaching a copy of thread.h
thread.h.txt

@robertjharrison
Copy link
Contributor

robertjharrison commented Jun 11, 2020 via email

@victor-anisimov
Copy link
Author

Here is the stack trace of process 0 when it enters an infinite loop.
stack-2.txt

@victor-anisimov
Copy link
Author

It is apparently test11 that causes the trouble. If I comment it out, the entire workload of test_world including 10 repetitions successfully complete.

@robertjharrison
Copy link
Contributor

robertjharrison commented Jun 11, 2020 via email

@victor-anisimov
Copy link
Author

victor-anisimov commented Jun 11, 2020

I recompiled everything with ENABLE_TBB=OFF, and reran the test "test_world" under debugger. The job hanged (entered in infinite loop) again on the second repetition. The stack trace collected when the job stopped making progress is attached.
stack-3.txt

@victor-anisimov
Copy link
Author

The original test_world.cc has the line
SafeMPI::Intracomm comm = world.mpi.comm().Split(SafeMPI::Intracomm::SHARED_SPLIT_TYPE, 0);
that does not do any splitting. I added some print statements to illustrate that no rank splitting happens here.

mpiexec -n 8 -ppn 4 ./test_world

log.00000:Split_color: iam = 0 node = iris08.ftm.alcf.anl.gov
log.00001:Split_color: iam = 1 node = iris08.ftm.alcf.anl.gov
log.00002:Split_color: iam = 2 node = iris08.ftm.alcf.anl.gov
log.00003:Split_color: iam = 3 node = iris08.ftm.alcf.anl.gov
log.00004:Split_color: iam = 4 node = iris18.ftm.alcf.anl.gov
log.00005:Split_color: iam = 5 node = iris18.ftm.alcf.anl.gov
log.00006:Split_color: iam = 6 node = iris18.ftm.alcf.anl.gov
log.00007:Split_color: iam = 7 node = iris18.ftm.alcf.anl.gov

Apparently, the invoked method should be Split_type() rather than Split() in this place.
SafeMPI::Intracomm comm = world.mpi.comm().Split_type(SafeMPI::Intracomm::SHARED_SPLIT_TYPE, 0);
In this form the code performs correct splitting of the ranks creating a local communicator on each of the two available nodes.

log.00000:Split_type: iam = 0 node = iris08.ftm.alcf.anl.gov
log.00001:Split_type: iam = 1 node = iris08.ftm.alcf.anl.gov
log.00002:Split_type: iam = 2 node = iris08.ftm.alcf.anl.gov
log.00003:Split_type: iam = 3 node = iris08.ftm.alcf.anl.gov
log.00004:Split_type: iam = 0 node = iris18.ftm.alcf.anl.gov
log.00005:Split_type: iam = 1 node = iris18.ftm.alcf.anl.gov
log.00006:Split_type: iam = 2 node = iris18.ftm.alcf.anl.gov
log.00007:Split_type: iam = 3 node = iris18.ftm.alcf.anl.gov

Multiple communicators in test11 still hang so I'm continuing to debug that issue.

@robertjharrison
Copy link
Contributor

robertjharrison commented Jul 21, 2020 via email

@victor-anisimov
Copy link
Author

I'm using MPICH.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants