Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

orterun: fix for command line parsing #1675

Closed
wants to merge 1 commit into from

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented May 18, 2016

orte-submit calls opal_init_util () before parsing out MCA command line
options (-mca, -am, etc). This prevents mpirun from setting opal MCA
variables for some frameworks as well as the MCA base. This is because
when a framework is opened all of its variables are set to read-only.
Eventually we want to lift this restriction on some MCA variables but
since -mca is affected we must parse out the MCA command line options
before opal_init_util(). This commit fixes the bug by adding a new
option to opal_cmd_line_parse (ignore unknown option) so orte-submit
can pre-parse the command line for MCA options.

Signed-off-by: Nathan Hjelm [email protected]

orte-submit calls opal_init_util () before parsing out MCA command line
options (-mca, -am, etc). This prevents mpirun from setting opal MCA
variables for some frameworks as well as the MCA base. This is because
when a framework is opened all of its variables are set to read-only.
Eventually we want to lift this restriction on some MCA variables but
since -mca is affected we must parse out the MCA command line options
before opal_init_util(). This commit fixes the bug by adding a new
option to opal_cmd_line_parse (ignore unknown option) so orte-submit
can pre-parse the command line for MCA options.

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn
Copy link
Member Author

hjelmn commented May 18, 2016

Still kind of on BST time right now so I threw this together. The Jenkins failure is not the only problem with the current way orterun handles command line options. With the current setup MCA variables in frameworks opened by opal_init_util() can not have their MCA variables set with mpirun -mca. @rhc54 This is an ugly fix but is the best we can do. MCA command line options must be parsed before opal_init*. There is no way around it.

@hjelmn
Copy link
Member Author

hjelmn commented May 18, 2016

@jladd-mlnx This should pass mellanox jenkins.

}

OBJ_RELEASE(orte_cmd_line);

Copy link
Member Author

@hjelmn hjelmn May 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhc54 This is the important part of the change. It sets up the MCA command line options and parses just those. It ignores all unknown options and switches. It will catch -mca var value and -am.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: there is no reason to release the cmd line here - the system now hierarchically builds upon it. I'll give you a PR that should be a little cleaner

@hjelmn
Copy link
Member Author

hjelmn commented May 18, 2016

Here is the output of env_mpi after the change:

mpirun -am ./amca.conf -n 2 ./env_mpi | grep ^XXX
XXX_A=1
XXX_B=2
XXX_C=3
XXX_D=4
XXX_E=5
XXX_A=1
XXX_B=2
XXX_C=3
XXX_D=4
XXX_E=5

@rhc54
Copy link
Contributor

rhc54 commented May 18, 2016

@hjelmn I'm puzzled by your comment about opal_init_util. I can find no frameworks being opened in that code, nor any reason why they would or should be. All that function is supposed to do is initialize the very bare minimum utilities to allow us to process cmd lines (this is why it exists). So what has changed, and where in that code are frameworks being opened?

@hjelmn
Copy link
Member Author

hjelmn commented May 18, 2016

Hmm, thought there was an open of event in there. Guess I was wrong.

I don't think we explicitly disallow mca_base_framework_open in opal_init_util(). Nothing in the documentation but that is out of date from what I can tell. Anyway, since it is technically possible and we don't want to have to go down the path of figuring out why variables in framework X can't be set using mpirun we either need to make it explicit in the documentation of opal_init_util() and then modify the MCA base parameters to be settable after registration (quite a bit of work) or just go back to parsing the MCA command line parameters before opal_init_util(). Eventually we should do the former (I want to expand the variable system to support object binding) but it probably won't happen anytime soon.

As for why this is still failing I am investigating. Can no longer reproduce the problem on my Mac but I will test on an alps/gni and slurm/mlnx system.

@hjelmn
Copy link
Member Author

hjelmn commented May 18, 2016

Also, nothing has changed with opal_init_util(). It has always registered variables. The change is you are now calling opal_init_util() before parsing the command line.

@rhc54
Copy link
Contributor

rhc54 commented May 18, 2016

I have no issue parsing the cmd line for MCA params prior to calling opal_init_util - my concern is that we keep that function as minimal as possible inline with its original purpose. Loading it up with a bunch of stuff defeats its purpose, so if that is happening, then perhaps we need to split the overall opal_init procedure again into yet another step?

I checked the history, and we did not always register variables in that code. This was introduced when the env forwarding was added. We got away with it until the recent change, but it definitely changed the intent of opal_init_util and probably should be moved to a later step (which is what I have done in another PR that I'll update once this is complete).

@ibm-ompi
Copy link

Test passed.

@jsquyres
Copy link
Member

@hjelmn @rhc54 Where did we land on this PR -- are we keeping it as-is, updating it to be a bit different, or ditching it?

@rhc54
Copy link
Contributor

rhc54 commented May 28, 2016

This one should be closed out - an alternative fix was used

@rhc54 rhc54 closed this May 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants