Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

New experiment config #3138

Merged
merged 34 commits into from
Dec 20, 2020
Merged

New experiment config #3138

merged 34 commits into from
Dec 20, 2020

Conversation

liuzhe-lz
Copy link
Contributor

@liuzhe-lz liuzhe-lz commented Nov 29, 2020

This PR defines the new experiment config schema.

Example Usage

Following code snippet will generate a valid config.yaml file:

from ruamel import yaml
from nni.experiment.config import ExperimentConfig, AlgorithmConfig, LocalConfig, convert

config = ExperimentConfig(
    experiment_name = 'MNIST',
    trial_concurrency = 1,
    max_experiment_duration = '10m',
    max_trial_number = 10,
    search_space_file = 'search_space.json',
    tuner = AlgorithmConfig(name='TPE', class_args={'optimize_mode': 'maximize'}),
    trial_command = 'python3 mnist.py',
    trial_code_directory = '.',
    training_service = LocalConfig(use_active_gpu=True)
)

config = convert.to_old_yaml(config)
yaml.dump(config, open('config.yaml', 'w'))

Or alternatively:

from ruamel import yaml
from nni.experiment.config import ExperimentConfig, convert

config = ExperimentConfig('local')
config.trial_concurrency = 1
config.max_experiment_duration = '10m'
config.max_trial_number = 10
config.search_space = {
    "dropout_rate": {"_type": "uniform", "_value": [0.5, 0.9]},
    "conv_size": {"_type": "choice", "_value": [2, 3, 5, 7]}
}
config.tuner = AlgorithmConfig(name='TPE', class_args={'optimize_mode': 'maximize'})
config.trial_command = 'python3 mnist.py'
config.trial_code_directory = '.'
config.training_service.use_active_gpu = True

config = convert.to_old_yaml(config)
yaml.dump(config, open('config.yaml', 'w'))

Design Notes

The schema should be expressible in both YAML/JSON object and Python dataclass.
In YAML/JSON the keys are expected to be camelCase and in Python they should be snake_case.
This technically is not a problem since Config.load() and Config.json() have routines to do the convert.

I think the consistency with old schema is not a case. Because consistence is to help migration, while we should not push exist users to migrate at all.
The most important goal of this refactor is to reduce users' learning cost. And for exist users, to not learn the cost is zero.
I believe we will continue to support old schema for very long time, maybe until NNI v3.0. So it basically benefits no body to push the migration unless a user is unsatisfied old schema and proactively want a new one.
And in the other hand, the more difference between two schema, the smaller possibility NNI may fail to detect schema version.

Taking Yuge's advice, the config is divided into platform-independent part and nested training service config.
Trial config is no longer nested because I can't find a reason to do so. While tuner/assessor/advisor config is still nested because the same format may repeats for more than one algorithms.
There is a problem that whether training service config should share the same key. Using same key is definitely better in Python, but using "localCofing" / "remoteConfig" may looks better in YAML.
The problem might affect heterogeneous training service. If training services use same key "training service config", it will be a object/dict for heterogeneous. And if they use different keys, there will be multiple training service config objects.

@liuzhe-lz liuzhe-lz marked this pull request as draft November 30, 2020 00:53
@liuzhe-lz liuzhe-lz mentioned this pull request Nov 30, 2020
77 tasks
@liuzhe-lz liuzhe-lz marked this pull request as ready for review November 30, 2020 21:58
@SparkSnail
Copy link
Contributor

Does this pr support heterogeneousConfig?


If not specified, this will be the default IPv4 address of outgoing connection.

// there is an easy way to get the IP mentioned above, I will remove netifaces denendency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: dependency

use annotation
--------------

Enable `annotation <https://nni.readthedocs.io/en/stable/Tutorial/AnnotationSpec.html>`_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest relative path

tuner
-----

Specify the tuner [TODO:link]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have the link now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find a doc mention built-in tuner and custom tuner at same time.

training service
----------------

Specify `training service <https://nni.readthedocs.io/en/stable/TrainingService/Overview.html>`_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

This must be an absolute path.


open pai config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think openpai is a word (at least OpenPAI is).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it will be openpaiConfig and openpai_config in yaml and python respectively. I think it looks weird. Discuss it in meeting.


If your are using desktop system with GUI, set this to ``True``.

// need to discuss default value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the default value here?


If set to zero, trials will have no access to any GPU.

If not specified, trials will be created and scheduled as if they do not use GPU,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the conclusion from last meeting? What does -1, 0, None mean respectively?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change from v1.

@@ -76,7 +76,7 @@ def to_v1_yaml(config: ExperimentConfig, skip_nnictl: bool = False) -> Dict[str,
data['remoteConfig'] = {'reuse': ts['reuseMode']}
data['machineList'] = []
for machine in ts['machineList']:
machine = {
data['machineList'].append({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about port?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mistake

data['trial']['image'] = ts['docker_image']
data['trial']['nniManagerNFSMountPath'] = ts['local_storage_mount_point']
data['trial']['containerNFSMountPath'] = ts['container_storage_mount_point']
data['paiConfig'] = {
'userName': ts['username'],
'token': ts['token'],
'host': 'https://' + ts['host'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The host does not contains prefix? how about http://?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean there are cases that OpenPAI cluster supports only HTTP and not HTTPS?

ret.append({'frameworkcontroller_config': experiment_config['frameworkcontrollerConfig']})

elif config.training_service.platform == 'adl':
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why pass adl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

port: int = 22
user: str
password: Optional[str] = None
ssh_key_file: PathLike = '~/.ssh/id_rsa'
Copy link
Contributor

@SparkSnail SparkSnail Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ssh_key is optional key, is used to connect remote machine, default is not '~/.ssh/id_rsa'. I think set it to None as default is better.

Copy link
Contributor Author

@liuzhe-lz liuzhe-lz Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said in doc, the behavior is:

  • when password is explicitly set, use password
  • when password is not set and identity file is explicitly set, use specified identify file
  • when both are not set, use default identity file of openssh

I expect this behavior because when using openssh directly, I don't need to specify anything when I put identify file at ~/.ssh/id_rsa.
I know it can be other algorithms like id_edXXX though, but I believe they are rarely used.

@liuzhe-lz liuzhe-lz merged commit a1016a6 into microsoft:master Dec 20, 2020
@liuzhe-lz liuzhe-lz deleted the config branch March 17, 2021 22:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants