Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SST: unexpected connection close event on Summit #3269

Closed
cwsmith opened this issue Jun 30, 2022 · 6 comments
Closed

SST: unexpected connection close event on Summit #3269

cwsmith opened this issue Jun 30, 2022 · 6 comments

Comments

@cwsmith
Copy link
Contributor

cwsmith commented Jun 30, 2022

Describe the bug

The following repo has an example that creates IOs and Engines for sending and receiving data between a server and client.

https://github.com/SCOREC/adios2SstTest/tree/6965c970f67b7404a70daa0b7dd12f49f0862521

Following the build and run instructions in the README results in the following output from the job using SST:

$ cat serverSst_2233751.out
rank 0 isServer 1
done s2c step
Writer 0 (0x3a15a180): Reader 0 (0x3a55d970): Message failed to send to writer 0
(0x18eef7c0)
done c2s step
Got an unexpected connection close event
SST stream open at exit, unlinking contact file (
done

$ cat clientSst_2233751.out
rank 0 isServer 0
done s2c step
done c2s step
SST stream open at exit, unlinking contact file
/gpfs/alpine/fus123/scratch/cwsmith/twoClientWdmCplTesting/client0_c2s.sst
done

Note, the repo contains the output logs from a run with SstVerbose=5.

I have an application that uses this IO and Engine setup logic. It appears to complete three data exchange iterations but at the end of execution produces the unexpected connection close event and SST stream open at exit outputs as seen above in the simple example. I'm concerned that there is an underlying problem that may cause failures in larger/longer runs and would like to fix it.

To Reproduce

See https://github.com/SCOREC/adios2SstTest/blob/6965c970f67b7404a70daa0b7dd12f49f0862521/README.md

Expected behavior

I expect there to be no warnings/errors when running with SST.

System and Environment

  • System: Summit
  • OS/Platform: RHEL8
  • Compiler: gcc/10.2.0 system module
  • CMake: cmake/3.21.3 system module
  • Adios2: adios2/2.7.1 system module

Additional context

None

Following up

This was an ID10T error. I was missing the calls to Engine Close().

@eisenhauer
Copy link
Member

Thanks for the report. Unfortunately it can be very difficult to sort out the proximate cause for this sort of thing, even with full verbosity. I'll be trying to duplicate the failure scenario myself to see if I can sort it out.

@cwsmith
Copy link
Contributor Author

cwsmith commented Jul 5, 2022

Hi @eisenhauer. You're welcome.

On a RHEL7 workstation using ADIOS 2.8.0.84 the server/writer process writes the unexpected connection close event and SST stream open at exit, unlinking contact file messages as observed on Summit. The info was added to the test repo:

https://github.com/SCOREC/adios2SstTest/blob/8f854f8ce8f15dd4a67b08594f5c9bb3c1eac56b/README.md#scorec-rhel7

@cwsmith
Copy link
Contributor Author

cwsmith commented Jul 5, 2022

Adding calls to engine Close() on the RHEL7 workstation results in a clean exit and no errors under valgrind. SCOREC/adios2SstTest@7274032

Testing on Summit now. The code exits cleanly on Summit now.

@cwsmith cwsmith closed this as completed Jul 5, 2022
@eisenhauer
Copy link
Member

Ah yes, that will do it...

@cwsmith
Copy link
Contributor Author

cwsmith commented Jul 5, 2022

@eisenhauer Is there any chance the created/opened Engines can be closed by the Engine or IO destructor? IIUC, the IO and adios objects are RAII and having the Engine clean up after itself would be nice.

@eisenhauer
Copy link
Member

That's probably worth talking about. Currently I don't think any of the engines have anything other than the default destructor. Mostly that works out OK for the file engines: open FDs will get cleaned up and already-written data flushed, and there are no special measure that one has to take to 'finalize' the open files. SST is a different beast though, with a lot of interaction between the readers and writers. Clean shutdown is a complicated dance. But perhaps we can simply call Close() in the destructor. My worry would be that, for example for writer engines, Close() can take a long time because it might be waiting on its peer to consume all the data. Close also involved MPI collective operations (to make sure that no rank goes away before all the others). I'm not sure how I feel about that being embedded in a destructor. But at a minimum we could throw a warning or an exception if the destructor is called without having called Close() first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants