-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
orterun: fix for command line parsing #1675
Conversation
orte-submit calls opal_init_util () before parsing out MCA command line options (-mca, -am, etc). This prevents mpirun from setting opal MCA variables for some frameworks as well as the MCA base. This is because when a framework is opened all of its variables are set to read-only. Eventually we want to lift this restriction on some MCA variables but since -mca is affected we must parse out the MCA command line options before opal_init_util(). This commit fixes the bug by adding a new option to opal_cmd_line_parse (ignore unknown option) so orte-submit can pre-parse the command line for MCA options. Signed-off-by: Nathan Hjelm <[email protected]>
Still kind of on BST time right now so I threw this together. The Jenkins failure is not the only problem with the current way orterun handles command line options. With the current setup MCA variables in frameworks opened by opal_init_util() can not have their MCA variables set with mpirun -mca. @rhc54 This is an ugly fix but is the best we can do. MCA command line options must be parsed before opal_init*. There is no way around it. |
@jladd-mlnx This should pass mellanox jenkins. |
} | ||
|
||
OBJ_RELEASE(orte_cmd_line); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhc54 This is the important part of the change. It sets up the MCA command line options and parses just those. It ignores all unknown options and switches. It will catch -mca var value and -am.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW: there is no reason to release the cmd line here - the system now hierarchically builds upon it. I'll give you a PR that should be a little cleaner
Here is the output of env_mpi after the change:
|
@hjelmn I'm puzzled by your comment about opal_init_util. I can find no frameworks being opened in that code, nor any reason why they would or should be. All that function is supposed to do is initialize the very bare minimum utilities to allow us to process cmd lines (this is why it exists). So what has changed, and where in that code are frameworks being opened? |
Hmm, thought there was an open of event in there. Guess I was wrong. I don't think we explicitly disallow mca_base_framework_open in opal_init_util(). Nothing in the documentation but that is out of date from what I can tell. Anyway, since it is technically possible and we don't want to have to go down the path of figuring out why variables in framework X can't be set using mpirun we either need to make it explicit in the documentation of opal_init_util() and then modify the MCA base parameters to be settable after registration (quite a bit of work) or just go back to parsing the MCA command line parameters before opal_init_util(). Eventually we should do the former (I want to expand the variable system to support object binding) but it probably won't happen anytime soon. As for why this is still failing I am investigating. Can no longer reproduce the problem on my Mac but I will test on an alps/gni and slurm/mlnx system. |
Also, nothing has changed with opal_init_util(). It has always registered variables. The change is you are now calling opal_init_util() before parsing the command line. |
I have no issue parsing the cmd line for MCA params prior to calling opal_init_util - my concern is that we keep that function as minimal as possible inline with its original purpose. Loading it up with a bunch of stuff defeats its purpose, so if that is happening, then perhaps we need to split the overall opal_init procedure again into yet another step? I checked the history, and we did not always register variables in that code. This was introduced when the env forwarding was added. We got away with it until the recent change, but it definitely changed the intent of opal_init_util and probably should be moved to a later step (which is what I have done in another PR that I'll update once this is complete). |
Test passed. |
This one should be closed out - an alternative fix was used |
orte-submit calls opal_init_util () before parsing out MCA command line
options (-mca, -am, etc). This prevents mpirun from setting opal MCA
variables for some frameworks as well as the MCA base. This is because
when a framework is opened all of its variables are set to read-only.
Eventually we want to lift this restriction on some MCA variables but
since -mca is affected we must parse out the MCA command line options
before opal_init_util(). This commit fixes the bug by adding a new
option to opal_cmd_line_parse (ignore unknown option) so orte-submit
can pre-parse the command line for MCA options.
Signed-off-by: Nathan Hjelm [email protected]