Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Commit

Permalink
Merge pull request #187 from microsoft/master
Browse files Browse the repository at this point in the history
merge master
  • Loading branch information
SparkSnail authored Jun 24, 2019
2 parents 93dd76b + 97829cc commit 1500458
Show file tree
Hide file tree
Showing 57 changed files with 809 additions and 556 deletions.
8 changes: 4 additions & 4 deletions docs/en_US/CustomizeTuner.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ class CustomizedTuner(Tuner):
def __init__(self, ...):
...

def receive_trial_result(self, parameter_id, parameters, value):
def receive_trial_result(self, parameter_id, parameters, value, **kwargs):
'''
Receive trial's final result.
parameter_id: int
Expand All @@ -41,7 +41,7 @@ class CustomizedTuner(Tuner):
# your code implements here.
...

def generate_parameters(self, parameter_id):
def generate_parameters(self, parameter_id, **kwargs):
'''
Returns a set of trial (hyper-)parameters, as a serializable object
parameter_id: int
Expand All @@ -51,15 +51,15 @@ class CustomizedTuner(Tuner):
...
```

`receive_trial_result` will receive the `parameter_id, parameters, value` as parameters input. Also, Tuner will receive the `value` object are exactly same value that Trial send.
`receive_trial_result` will receive the `parameter_id, parameters, value` as parameters input. Also, Tuner will receive the `value` object are exactly same value that Trial send. If `multiPhase` is set to `true` in the experiment configuration file, an additional `trial_job_id` parameter is passed to `receive_trial_result` and `generate_parameters` through the `**kwargs` parameter.

The `your_parameters` return from `generate_parameters` function, will be package as json object by NNI SDK. NNI SDK will unpack json object so the Trial will receive the exact same `your_parameters` from Tuner.

For example:
If the you implement the `generate_parameters` like this:

```python
def generate_parameters(self, parameter_id):
def generate_parameters(self, parameter_id, **kwargs):
'''
Returns a set of trial (hyper-)parameters, as a serializable object
parameter_id: int
Expand Down
10 changes: 9 additions & 1 deletion docs/en_US/MultiPhase.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,15 @@ To enable multi-phase, you should also add `multiPhase: true` in your experiment

### Write a tuner that leverages multi-phase:

Before writing a multi-phase tuner, we highly suggest you to go through [Customize Tuner](https://nni.readthedocs.io/en/latest/Customize_Tuner.html). Different from writing a normal tuner, your tuner needs to inherit from `MultiPhaseTuner` (in nni.multi_phase_tuner). The key difference between `Tuner` and `MultiPhaseTuner` is that the methods in MultiPhaseTuner are aware of additional information, that is, `trial_job_id`. With this information, the tuner could know which trial is requesting a configuration, and which trial is reporting results. This information provides enough flexibility for your tuner to deal with different trials and different phases. For example, you may want to use the trial_job_id parameter of generate_parameters method to generate hyperparameters for a specific trial job.
Before writing a multi-phase tuner, we highly suggest you to go through [Customize Tuner](https://nni.readthedocs.io/en/latest/Customize_Tuner.html). Same as writing a normal tuner, your tuner needs to inherit from `Tuner` class. When you enable multi-phase through configuration (set `multiPhase` to true), your tuner will get an additional parameter `trial_job_id` via tuner's following methods:
```
generate_parameters
generate_multiple_parameters
receive_trial_result
receive_customized_trial_result
trial_end
```
With this information, the tuner could know which trial is requesting a configuration, and which trial is reporting results. This information provides enough flexibility for your tuner to deal with different trials and different phases. For example, you may want to use the trial_job_id parameter of generate_parameters method to generate hyperparameters for a specific trial job.

Of course, to use your multi-phase tuner, __you should add `multiPhase: true` in your experiment YAML configure file__.

Expand Down
9 changes: 7 additions & 2 deletions docs/en_US/WebUI.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Click the tab "Overview".
* Support to download the experiment result.
* Support to export nni-manager and dispatcher log file.
* If you have any question, you can click "Feedback" to report it.
* If your experiment have more than 1000 trials, you can change the refresh interval on here.

![](../img/webui-img/over1.png)
* See good performance trials.
Expand Down Expand Up @@ -58,6 +59,10 @@ Click the tab "Trials Detail" to see the status of the all trials. Specifically:

![](../img/webui-img/addColumn.png)

* If you want to compare some trials, you can select them and then click "Compare" to see the results.

![](../img/webui-img/compare.png)

* You can use the button named "Copy as python" to copy trial's parameters.

![](../img/webui-img/copyParameter.png)
Expand All @@ -69,6 +74,6 @@ Click the tab "Trials Detail" to see the status of the all trials. Specifically:

* Kill: you can kill a job that status is running.
* Support to search for a specific trial.
* Intermediate Result Graph.
* Intermediate Result Graph: you can see default and other keys in this graph.

![](../img/intermediate.png)
![](../img/webui-img/intermediate.png)
Binary file removed docs/img/intermediate.png
Binary file not shown.
Binary file modified docs/img/webui-img/addColumn.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/webui-img/compare.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/img/webui-img/copyParameter.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/img/webui-img/detail-local.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/img/webui-img/detail-pai.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/webui-img/intermediate.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/img/webui-img/over1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions examples/tuners/ga_customer_tuner/customer_tuner.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ def __init__(self, optimize_mode, population_size = 32):
logger.debug('init population done.')
return

def generate_parameters(self, parameter_id):
def generate_parameters(self, parameter_id, **kwargs):
"""Returns a set of trial graph config, as a serializable object.
parameter_id : int
"""
Expand Down Expand Up @@ -109,7 +109,7 @@ def generate_parameters(self, parameter_id):
return temp


def receive_trial_result(self, parameter_id, parameters, value):
def receive_trial_result(self, parameter_id, parameters, value, **kwargs):
'''
Record an observation of the objective function
parameter_id : int
Expand Down
4 changes: 2 additions & 2 deletions examples/tuners/random_nas_tuner/random_nas_tuner.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,12 +49,12 @@ def update_search_space(self, search_space):
self.searchspace_json = search_space
self.random_state = np.random.RandomState()

def generate_parameters(self, parameter_id):
def generate_parameters(self, parameter_id, **kwargs):
'''generate
'''
return random_archi_generator(self.searchspace_json, self.random_state)

def receive_trial_result(self, parameter_id, parameters, value):
def receive_trial_result(self, parameter_id, parameters, value, **kwargs):
'''receive
'''
pass
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ def init_population(self, population_size, graph_max_layer, graph_min_layer):
population.append(Individual(indiv_id=self.generate_new_id(), graph_cfg=graph_tmp, result=None))
return population

def generate_parameters(self, parameter_id):
def generate_parameters(self, parameter_id, **kwargs):
"""Returns a set of trial graph config, as a serializable object.
An example configuration:
```json
Expand Down Expand Up @@ -196,7 +196,7 @@ def generate_parameters(self, parameter_id):
logger.debug("trial {} ready".format(indiv.indiv_id))
return param_json

def receive_trial_result(self, parameter_id, parameters, value):
def receive_trial_result(self, parameter_id, parameters, value, **kwargs):
'''
Record an observation of the objective function
parameter_id : int
Expand Down
2 changes: 1 addition & 1 deletion src/nni_manager/common/utils.ts
Original file line number Diff line number Diff line change
Expand Up @@ -375,7 +375,7 @@ function countFilesRecursively(directory: string, timeoutMilliSeconds?: number):
}

function validateFileName(fileName: string): boolean {
let pattern: string = '^[a-z0-9A-Z\.-_]+$';
let pattern: string = '^[a-z0-9A-Z\._-]+$';
const validateResult = fileName.match(pattern);
if(validateResult) {
return true;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,10 @@ export abstract class ClusterJobRestServer extends RestServer {
this.port = basePort + 1;
}

get apiRootUrl(): string {
return this.API_ROOT_URL;
}

public get clusterRestServerPort(): number {
if (this.port === undefined) {
throw new Error('PAI Rest server port is undefined');
Expand Down Expand Up @@ -87,7 +91,7 @@ export abstract class ClusterJobRestServer extends RestServer {
protected abstract handleTrialMetrics(jobId : string, trialMetrics : any[]) : void;

// tslint:disable: no-unsafe-any no-any
private createRestHandler() : Router {
protected createRestHandler() : Router {
const router: Router = Router();

router.use((req: Request, res: Response, next: any) => {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -355,7 +355,8 @@ class LocalTrainingService implements TrainingService {
this.log.info('Stopping local machine training service...');
this.stopping = true;
for (const stream of this.jobStreamMap.values()) {
stream.destroy();
stream.end(0)
stream.emit('end')
}
if (this.gpuScheduler !== undefined) {
await this.gpuScheduler.stop();
Expand All @@ -372,7 +373,9 @@ class LocalTrainingService implements TrainingService {
if (stream === undefined) {
throw new Error(`Could not find stream in trial ${trialJob.id}`);
}
stream.destroy();
//Refer https://github.com/Juul/tail-stream/issues/20
stream.end(0)
stream.emit('end')
this.jobStreamMap.delete(trialJob.id);
}
}
Expand Down Expand Up @@ -567,7 +570,6 @@ class LocalTrainingService implements TrainingService {
buffer = remain;
}
});

this.jobStreamMap.set(trialJobDetail.id, stream);
}

Expand Down
8 changes: 4 additions & 4 deletions src/nni_manager/training_service/pai/paiData.ts
Original file line number Diff line number Diff line change
Expand Up @@ -64,11 +64,11 @@ else
fi`;

export const PAI_TRIAL_COMMAND_FORMAT: string =
`export NNI_PLATFORM=pai NNI_SYS_DIR={0} NNI_OUTPUT_DIR={1} NNI_TRIAL_JOB_ID={2} NNI_EXP_ID={3} NNI_TRIAL_SEQ_ID={4} \
`export NNI_PLATFORM=pai NNI_SYS_DIR={0} NNI_OUTPUT_DIR={1} NNI_TRIAL_JOB_ID={2} NNI_EXP_ID={3} NNI_TRIAL_SEQ_ID={4} MULTI_PHASE={5} \
&& cd $NNI_SYS_DIR && sh install_nni.sh \
&& python3 -m nni_trial_tool.trial_keeper --trial_command '{5}' --nnimanager_ip '{6}' --nnimanager_port '{7}' \
--pai_hdfs_output_dir '{8}' --pai_hdfs_host '{9}' --pai_user_name {10} --nni_hdfs_exp_dir '{11}' --webhdfs_path '/webhdfs/api/v1' \
--nni_manager_version '{12}' --log_collection '{13}'`;
&& python3 -m nni_trial_tool.trial_keeper --trial_command '{6}' --nnimanager_ip '{7}' --nnimanager_port '{8}' \
--pai_hdfs_output_dir '{9}' --pai_hdfs_host '{10}' --pai_user_name {11} --nni_hdfs_exp_dir '{12}' --webhdfs_path '/webhdfs/api/v1' \
--nni_manager_version '{13}' --log_collection '{14}'`;

export const PAI_OUTPUT_DIR_FORMAT: string =
`hdfs://{0}:9000/`;
Expand Down
38 changes: 38 additions & 0 deletions src/nni_manager/training_service/pai/paiJobRestServer.ts
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,26 @@

'use strict';

import { Request, Response, Router } from 'express';
import { Inject } from 'typescript-ioc';
import * as component from '../../common/component';
import { ClusterJobRestServer } from '../common/clusterJobRestServer';
import { PAITrainingService } from './paiTrainingService';

export interface ParameterFileMeta {
readonly experimentId: string;
readonly trialId: string;
readonly filePath: string;
}

/**
* PAI Training service Rest server, provides rest API to support pai job metrics update
*
*/
@component.Singleton
export class PAIJobRestServer extends ClusterJobRestServer {
private parameterFileMetaList: ParameterFileMeta[] = [];

@Inject
private readonly paiTrainingService : PAITrainingService;

Expand All @@ -52,4 +61,33 @@ export class PAIJobRestServer extends ClusterJobRestServer {
});
}
}

protected createRestHandler(): Router {
const router: Router = super.createRestHandler();

router.post(`/parameter-file-meta`, (req: Request, res: Response) => {
try {
this.log.info(`POST /parameter-file-meta, body is ${JSON.stringify(req.body)}`);
this.parameterFileMetaList.push(req.body);
res.send();
} catch (err) {
this.log.error(`POST parameter-file-meta error: ${err}`);
res.status(500);
res.send(err.message);
}
});

router.get(`/parameter-file-meta`, (req: Request, res: Response) => {
try {
this.log.info(`GET /parameter-file-meta`);
res.send(this.parameterFileMetaList);
} catch (err) {
this.log.error(`GET parameter-file-meta error: ${err}`);
res.status(500);
res.send(err.message);
}
});

return router;
}
}
70 changes: 64 additions & 6 deletions src/nni_manager/training_service/pai/paiTrainingService.ts
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ import { MethodNotImplementedError } from '../../common/errors';
import { getExperimentId, getInitTrialSequenceId } from '../../common/experimentStartupInfo';
import { getLogger, Logger } from '../../common/log';
import {
JobApplicationForm, NNIManagerIpConfig, TrainingService,
HyperParameters, JobApplicationForm, NNIManagerIpConfig, TrainingService,
TrialJobApplicationForm, TrialJobDetail, TrialJobMetric
} from '../../common/trainingService';
import { delay, generateParamFileName,
Expand All @@ -45,7 +45,7 @@ import { HDFSClientUtility } from './hdfsClientUtility';
import { NNIPAITrialConfig, PAIClusterConfig, PAIJobConfig, PAITaskRole } from './paiConfig';
import { PAI_LOG_PATH_FORMAT, PAI_OUTPUT_DIR_FORMAT, PAI_TRIAL_COMMAND_FORMAT, PAITrialJobDetail } from './paiData';
import { PAIJobInfoCollector } from './paiJobInfoCollector';
import { PAIJobRestServer } from './paiJobRestServer';
import { PAIJobRestServer, ParameterFileMeta } from './paiJobRestServer';

import * as WebHDFS from 'webhdfs';

Expand Down Expand Up @@ -79,6 +79,7 @@ class PAITrainingService implements TrainingService {
private copyExpCodeDirPromise?: Promise<void>;
private versionCheck: boolean = true;
private logCollection: string;
private isMultiPhase: boolean = false;

constructor() {
this.log = getLogger();
Expand Down Expand Up @@ -179,12 +180,22 @@ class PAITrainingService implements TrainingService {
return deferred.promise;
}

public updateTrialJob(trialJobId: string, form: JobApplicationForm): Promise<TrialJobDetail> {
throw new MethodNotImplementedError();
public async updateTrialJob(trialJobId: string, form: JobApplicationForm): Promise<TrialJobDetail> {
const trialJobDetail: undefined | TrialJobDetail = this.trialJobsMap.get(trialJobId);
if (trialJobDetail === undefined) {
throw new Error(`updateTrialJob failed: ${trialJobId} not found`);
}
if (form.jobType === 'TRIAL') {
await this.writeParameterFile(trialJobId, (<TrialJobApplicationForm>form).hyperParameters);
} else {
throw new Error(`updateTrialJob failed: jobType ${form.jobType} not supported.`);
}

return trialJobDetail;
}

public get isMultiPhaseJobSupported(): boolean {
return false;
return true;
}

// tslint:disable:no-http-string
Expand Down Expand Up @@ -336,6 +347,9 @@ class PAITrainingService implements TrainingService {
case TrialConfigMetadataKey.LOG_COLLECTION:
this.logCollection = value;
break;
case TrialConfigMetadataKey.MULTI_PHASE:
this.isMultiPhase = (value === 'true' || value === 'True');
break;
default:
//Reject for unknown keys
throw new Error(`Uknown key: ${key}`);
Expand Down Expand Up @@ -445,6 +459,7 @@ class PAITrainingService implements TrainingService {
trialJobId,
this.experimentId,
trialJobDetail.sequenceId,
this.isMultiPhase,
this.paiTrialConfig.command,
nniManagerIp,
this.paiRestServerPort,
Expand Down Expand Up @@ -632,7 +647,50 @@ class PAITrainingService implements TrainingService {
return Promise.race([timeoutDelay, deferred.promise])
.finally(() => { clearTimeout(timeoutId); });
}
// tslint:enable:no-any no-unsafe-any no-http-string

private async writeParameterFile(trialJobId: string, hyperParameters: HyperParameters): Promise<void> {
if (this.paiClusterConfig === undefined) {
throw new Error('PAI Cluster config is not initialized');
}
if (this.paiTrialConfig === undefined) {
throw new Error('PAI trial config is not initialized');
}

const trialLocalTempFolder: string = path.join(getExperimentRootDir(), 'trials-local', trialJobId);
const hpFileName: string = generateParamFileName(hyperParameters);
const localFilepath: string = path.join(trialLocalTempFolder, hpFileName);
await fs.promises.writeFile(localFilepath, hyperParameters.value, { encoding: 'utf8' });
const hdfsCodeDir: string = HDFSClientUtility.getHdfsTrialWorkDir(this.paiClusterConfig.userName, trialJobId);
const hdfsHpFilePath: string = path.join(hdfsCodeDir, hpFileName);

await HDFSClientUtility.copyFileToHdfs(localFilepath, hdfsHpFilePath, this.hdfsClient);

await this.postParameterFileMeta({
experimentId: this.experimentId,
trialId: trialJobId,
filePath: hdfsHpFilePath
});
}

private postParameterFileMeta(parameterFileMeta: ParameterFileMeta): Promise<void> {
const deferred : Deferred<void> = new Deferred<void>();
const restServer: PAIJobRestServer = component.get(PAIJobRestServer);
const req: request.Options = {
uri: `${restServer.endPoint}${restServer.apiRootUrl}/parameter-file-meta`,
method: 'POST',
json: true,
body: parameterFileMeta
};
request(req, (err: Error, res: request.Response) => {
if (err) {
deferred.reject(err);
} else {
deferred.resolve();
}
});

return deferred.promise;
}
}

export { PAITrainingService };
3 changes: 2 additions & 1 deletion src/nni_manager/types/tail-stream/index.d.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
declare module 'tail-stream' {
export interface Stream {
on(type: 'data', callback: (data: Buffer) => void): void;
destroy(): void;
end(data: number): void;
emit(data: string): void;
}
export function createReadStream(path: string): Stream;
}
Loading

0 comments on commit 1500458

Please sign in to comment.