Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Hotfix for v1.7.1 #2692

Merged
merged 8 commits into from
Jul 19, 2020
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,8 @@ export class AMLEnvironmentService extends EnvironmentService {
public async refreshEnvironmentsStatus(environments: EnvironmentInformation[]): Promise<void> {
environments.forEach(async (environment) => {
const amlClient = (environment as AMLEnvironmentInformation).amlClient;
if (!amlClient) {
throw new Error('AML client not initialized!');
if (!amlClient) {
return Promise.reject('AML client not initialized!');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, why throw error doesn't work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no try catch outside the function, throw this error directly will not be shown in web ui.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try it? there are a lot of throw error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I tried it. If we only use throw Error without catch, the error information will only shown in nnictl log stderr instead of webui.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The throw new Error here should work without try catch because there is a catch in nnimanager and this promise is called from nnimanager. It works in other training service. Would you please check why it is not working?

}
const status = await amlClient.updateStatus(environment.status);
switch (status.toUpperCase()) {
Expand All @@ -98,8 +98,7 @@ export class AMLEnvironmentService extends EnvironmentService {
environment.setFinalStatus('SUCCEEDED');
break;
case 'FAILED':
environment.setFinalStatus('FAILED');
break;
return Promise.reject(`AML: job ${environment.jobId} is failed!`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides returning reject, do we need to set final status here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need to set status for environment here. when return Promise.reject, the experiment become error status, and nniManager stops. this final status is not useful anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, the status is not visible so far. But if we expose it to webui in future, it would be a hole.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

case 'STOPPED':
case 'STOPPING':
environment.setFinalStatus('USER_CANCELED');
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -121,9 +121,11 @@ export class OpenPaiEnvironmentService extends EnvironmentService {
// RUNNING status is set by runner, and ignore waiting status
break;
case 'SUCCEEDED':
case 'FAILED':
environment.setFinalStatus(jobResponse.state);
break;
case 'FAILED':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a break above Failed, so that succeeded env won't fail experiment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

deferred.reject(`OpenPAI: job ${environment.jobId} is failed!`);
break;
case 'STOPPED':
case 'STOPPING':
environment.setFinalStatus('USER_CANCELED');
Expand Down
11 changes: 6 additions & 5 deletions src/nni_manager/training_service/reusable/trialDispatcher.ts
Original file line number Diff line number Diff line change
Expand Up @@ -526,19 +526,20 @@ class TrialDispatcher implements TrainingService {
}

private async handleStdout(commandData: any): Promise<void> {
const metricPattern: RegExp = /NNISDK_MEb'(?<metrics>.*a?)'$/gm;
const trialLogDir: string = path.join(getExperimentRootDir(), 'trials', commandData["trial"]);
mkDirPSync(trialLogDir);
const trialLogPath: string = path.join(trialLogDir, 'stdout_log_collection.log');
try {
let skipLogging: boolean = false;
if (commandData["tag"] === 'trial' && commandData["msg"] !== undefined) {
const message = commandData["msg"];
const metricsContent: any = message.match(this.NNI_METRICS_PATTERN);
if (metricsContent && metricsContent.groups) {
const message: string = commandData["msg"];
let metricsContent = metricPattern.exec(message);
while (metricsContent && metricsContent.groups) {
const key: string = 'metrics';
const data = metricsContent.groups[key];
const metricData = JSON.parse('"' + data.split('"').join('\\"') + '"');
await this.handleMetricData(commandData["trial"], metricData);
await this.handleMetricData(commandData["trial"], data);
metricsContent = metricPattern.exec(message);
skipLogging = true;
}
}
Expand Down
6 changes: 5 additions & 1 deletion tools/nni_trial_tool/base_channel.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,11 @@ def open(self):

def close(self):
self.is_running = False
self._inner_close()
try:
self._inner_close()
except Exception as err:
# ignore any error on closing
print("error on closing channel: %s" % err)

def send(self, command, data):
"""Send command to Training Service.
Expand Down
4 changes: 2 additions & 2 deletions tools/nni_trial_tool/web_channel.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,9 @@ def _inner_open(self):
def _inner_close(self):
if self.client is not None:
self.client.close()
if self._event_loop.is_running():
self._event_loop.close()
self.client = None
if self._event_loop.is_running():
self._event_loop.stop()
self._event_loop = None

def _inner_send(self, message):
Expand Down