-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC : use a unique top_session_dir directory unless direct launch'ed #2088
POC : use a unique top_session_dir directory unless direct launch'ed #2088
Conversation
@rhc54 this commit implements the truly unique @artpol84 i saw you defined a note the first commit comes from #2084 , it is only here in order to prevent a merge conflict |
Will this work if tmpdir is on the shared FS? |
@ggouaillardet |
Well, i call `mkdtemp("$TMP/ompi../XXXXXX") Thanks for the |
Build Failed with PGI compiler! Please review the log, and get in touch if you have questions. Gist: https://gist.github.com/f60fd58c59b9485b2c564d12937fd1b3 |
So you making the assumption that mkstemp will check the directory content before creating the directory. Consider 2 options:
According to google it seems that mkstemp supposed to work over NFS: So I guess the disadvantage of this PR is in possible jobstart delays. |
btw according to http://www-01.ibm.com/support/docview.wss?uid=isg1IV45350 you need to call mkstemp another time if getting |
@ggouaillardet sorry I missed your point about hostname in the tempdir name. Indeed this is sufficient to avoid conflicts. |
I'm not entirely sure that this is doing the right thing, so let me explain how the session dir works. The top-level directory is setup solely on the basis of the node/user, and thus is shared across multiple mpirun's. People often do run multiple mpirun's in parallel, so you cannot just blow that directory away at the end of any one execution. Underneath that top-level dir, each mpirun creates its own jobfamily-level directory based on its pid. This directory is indeed unique to a given mpirun, and it can be safely destroyed when that mpirun completes. However, if it has "output" files in it, then you cannot destroy the directory tree as the user needs those files! This is why our cleanup program carefully checks every filename before deleting it. So how does this proposal take these matters into account? |
well, this is not exactly what master is doing. from
and
bottom line, i see some potential issues here. currently, this part has been changed recently on master, mainly in 81195ab but i think the same issue exists in v2.x (e.g. job family collision can occur) so unless i am missing something :
|
Something doesn't sound right here - it could be that something got inadvertently changed during a prior commit. OMPI has always maintained a single top-level directory, with each mpirun creating a job-family directory underneath it. Each mpirun cleaned up by working its way up the directory tree, removing its own entries and then testing to see if things were empty. Thus, the top-level directory was removed by the "last-one-out" method. Direct-launched jobs do indeed use a variation of this, but there should not be three levels of variation. Such jobs likewise used the same "last-one-out" cleanup method. All this worked for nearly a decade now. It may be that the PMIx usock rendezvous files weren't positioned in the right place, and/or the cleanup code not adjusted to remove them. If so, then perhaps that is what needs to be addressed. I'd rather not commit a change to the basic session directory system until we fully detail what we are doing and ensure that all use-cases are properly handled. It's clear that things have evolved in a manner that may not have been fully considered, so let's take a moment to step back and figure out what we really want this setup to do. |
Please, see below 2016-09-20 8:47 GMT+07:00 Gilles Gouaillardet [email protected]:
С Уважением, Поляков Артем Юрьевич |
@rhc54 I guess that @ggouaillardet is mentioning this 3 cases: |
We do use the RM-provided jobid - we always have. The only time we generate one ourselves is when we launch via mpirun, and that is required as not every RM provides us with a unique jobid for each mpirun invocation (in fact, if we use the ssh PLM, then we never get a "stepid"). So mpirun will always generate its own, but that has nothing to do with the session dir question. |
If we launched with mpirun I don't think we use RM's jobid. |
But we can in case of SLURM and Torque |
No, we can't - there can be multiple mpirun invocations within a given job, and there is no guarantee those are going to launch via srun. Thus, there is no guarantee they are being assigned a unique "stepid", and they all share the same jobid (what we would call a "session" id as it is related to the allocation and not the specific mpirun). We've been thru this multiple times over the years - truly, there is a reason why we do things this way 😄 |
If we are using SLURM plm we can use the jobid aren't we? |
It's PLM who is responsible for generating jobid's so why do we avoid using them if SLURM/TORQUE/whatever plm was selected? |
And jobid generation is certainly has something to do with session directories creation because it affects directory names and here we are trying to solve the collision problem. |
What I'm suggesting is to have jobfam to be generated using RM's JOBID. |
@artpol84 you are right about the commit id that introduce the change of directory name in master my understanding of @rhc54 previous reply, is that if your slurm script does
then you can obviously not use slurm jobid as a job family, otherwise both Open MPI jobs will end up with the same job family. |
I'm trying to find out how the caller of the srun can get the stepid. |
@rhc54 i am fine to take some time thinking about that. my understanding of the current situation is
we recently got a bug report, and some of us, including me, went too fast putting the blame on a imho, i'd rather make sure we never have to worry about collision any more, hence this relatively non intrusive commit that could be backported to v2.x and v1.10. in the longer term, and as you previously suggested, |
2016-09-20 11:26 GMT+07:00 Gilles Gouaillardet [email protected]:
С Уважением, Поляков Артем Юрьевич |
@artpol84 i am sure there is a way to do that (ideally that would be straight forward, worst case scenario is we have to be (very) creative). |
2016-09-20 11:50 GMT+07:00 Gilles Gouaillardet [email protected]:
С Уважением, Поляков Артем Юрьевич |
What is the problem with |
currently,
also, i am pretty sure we assume stepid of if two distinct jobs share the same job family, we have to
for |
@artpol84 i never thought of using random numbers for jobid nor stepid (pid to jobfam hashing in Open MPI is a deterministic function, stepid are created in sequence) if you submit slurm jobs |
I've said "random" because current local ID assignment reminds me PRND. |
Correction - jobid assignment. |
I agree that demand of having orteds assigned a unique jobid conflicts with the fact that you do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a Signed-off-by line to this PR's commit.
@ggouaillardet, what's the plan here? |
681e9ff
to
13932a3
Compare
currently, top_session_dir is based on hostname, uid, HNP pid and job family. there is a risk a top_session_dir exists and contains some old data when a job starts, leading to undefined behavior. if the app is started via mpirun, use a unique top_session_dir mkdtemp("$TMP/ompi.<hostname>.<uid>/XXXXXX") that is passed to fork'ed MPI tasks via the OPAL_MCA_PREFIX"orte_top_session_dir" environment variable. if the app is direct launched, then the current behavior is unchanged. direct launch behavior will be enhanced when PMIx is able to pass a per-node directory (PMIX_NSDIR ?) to a direct launched task. Signed-off-by: Gilles Gouaillardet <[email protected]>
13932a3
to
f0c0c33
Compare
no consensus was reached, so archiving this PR to https://github.com/open-mpi/ompi/wiki/Archived-PR |
No description provided.