Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Support remote training service use reuse mode #2923

Merged
merged 49 commits into from
Oct 10, 2020

Conversation

SparkSnail
Copy link
Contributor

@SparkSnail SparkSnail commented Sep 23, 2020

PR is updated to remove gpuScheduler, and start trial_runner process in all machines when initialize process.

SparkSnail added 30 commits May 29, 2020 17:02

#### reuse

Optional. Set if use trial_runner to maintan multiple trial.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also not clear. you can describe the benefit when reuse is set True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@@ -592,6 +592,14 @@ Specifies the pre-command that will be executed before the remote machine execut

__Note__: Because __preCommand__ will execute before other commands each time, it is strongly not recommended to set __preCommand__ that will make changes to system, i.e. `mkdir` or `touch`.

### remoteConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a little strange to have both "machineList" and "remoteConfig" in the same level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

machineList is a list type field, I considered merge machineList under remoteConfig, but it may cause compatibility problem.

public abstract stopEnvironment(environment: EnvironmentInformation): Promise<void>;
public abstract startEnvironment(environment: EnvironmentInformation): Promise<void>;
public abstract getInitializeEnvironmentNumber(): number;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. How about merge this function with hasStorageService, environmentMaintenceLoopInterval? they can return a structure.
  2. getInitializeEnvironmentNumber is a weired name, how about prefetchedEnvironmentCount?
  3. refer to hasMoreEnvironments to provide a default implementation, and return 0. So that other service don't need to repeat to return 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to use prefetchedEnvironmentCount(). I think it is better split these three data, for different platform has different data type and value. for example, openPAI only need to set hasStorageService, while remote need to set prefetchedEnvironmentCount, split them is more flexible.

@@ -578,6 +578,15 @@ class TrialDispatcher implements TrainingService {

}
}

private async initEnvironments(): Promise<void> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefetchEnvironments may be more accurate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

const number = environmentService.getInitializeEnvironmentNumber();
this.log.info(`Initialize environments total number: ${number}`);
for (let index = 0; index < number; index++) {
await this.requestEnvironment();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may need to be async call, and wait together. So that it can be faster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a look on code, remote machine starts all envs in config, so it has no effect. But others may implement with a sync way, so it should be async, and wait all here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't get the point, do you mean await Promise.all()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree to do it concurrently (Promise.all) if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remote machine is scheduled sequentially, if request environment simultaneously, there might be potential issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it can be initialized sequentially, it looks no benefit to initialize before running. If there are 1000 machines, it may need 10 minutes, and first runner timeout already.


constructor(remoteMachineMetaList: RemoteMachineMeta[]) {
assert(remoteMachineMetaList.length > 0);
this.remoteMachineMetaOccupiedMap = new Map<RemoteMachineMeta, boolean>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks only maintain a map. so it doesn't need to be a separated class. Should be in enviornment service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep a extra class is more clearly, while it might cause a little complex in code. Removed in current version, could add them back if have more requirement in the future.

}
}

private async launchEnvironmentOnScheduledMachine(environment: RemoteMachineEnvironmentInformation): Promise<void> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name is strange. how about prepareEnvironment or launchTrialRunner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepareEnvironment is already used, this function is used to launcher an environment process, changed to use launcherEnvironment()

* If a trial is finished, release the connection resource
* @param trial remote machine trial job detail
*/
public releaseTrialResource(environment: EnvironmentInformation): void {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be called like releaseEnvironment or stopRunner. Trial is confusing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

private readonly machineExecutorManagerMap: Map<RemoteMachineMeta, ExecutorManager>;
private readonly machineCopyExpCodeDirPromiseMap: Map<RemoteMachineMeta, Promise<void>>;
private readonly environmentExecutorManagerMap: Map<string, ExecutorManager>;
private readonly environmentJobsMap: Map<string, RemoteMachineEnvironmentInformation>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

environmentJobsMap may be unused, and can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

deferred.resolve(false);
} else {
environment.rmMachineMeta = rmMachineMeta;
const copyExpCodeDirPromise = this.machineCopyExpCodeDirPromiseMap.get(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this promise object is used to copy data to remote machine, and initialized in setClusterMetadata() step. need to use this object to workaround, otherwise setClusterMetadata will cost much time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't find setClusterMetadata, is it replaced by sshConnectionPromises?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

this.experimentRootDir = getExperimentRootDir();
this.experimentId = getExperimentId();
this.log = getLogger();
this.log.info('Construct remote machine training service.');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line of log is useless. If it's needed, it means one line log should be added at higher level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

return undefined;
}

private recycleMachineReservation(rmMeta: RemoteMachineMeta): void {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's called by one function only, and the name is confused with caller. don't need to be a seperated function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

const isAlive = await executor.isProcessAlive(jobpidPath);
// if the process of jobpid is not alive any more
if (!isAlive) {
this.log.info(`pid in ${jobpidPath} is not alive!`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may need to log remote machine information, or it's hard to find which node is not alive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

if (!isAlive) {
this.log.info(`pid in ${jobpidPath} is not alive!`);
if (fs.existsSync(trialReturnCodeFilePath)) {
const trialReturnCode: string = await executor.getRemoteFileContent(trialReturnCodeFilePath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change names trial to runner. It's confusing in the new infra.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

* If a environment is finished, release the connection resource
* @param environment remote machine environment job detail
*/
public releaseEnvironmentResource(environment: EnvironmentInformation): void {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private function? If it's proactive stopping, kill the trial runner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

}
}

private async launchEnvironment(environment: RemoteMachineEnvironmentInformation): Promise<void> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

launchEnvironment is similar with startEnvironment. launchRunner or other name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

const remoteEnvironment: RemoteMachineEnvironmentInformation = environment as RemoteMachineEnvironmentInformation;
remoteEnvironment.status = 'WAITING';
await this.prepareEnvironment(remoteEnvironment);
await this.launchEnvironment(remoteEnvironment);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called here, and inside of prepareEnvironment also. Is it expected? The logic can be clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored.

}

private async prepareEnvironment(environment: RemoteMachineEnvironmentInformation): Promise<boolean> {
const deferred: Deferred<boolean> = new Deferred<boolean>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessary to use deferred

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

}

public async refreshEnvironmentsStatus(environments: EnvironmentInformation[]): Promise<void> {
environments.forEach(async (environment) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it better to refresh all environment concurrently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

await executor.killChildProcesses(jobpidPath);
this.releaseEnvironmentResource(environment);
} catch (error) {
this.log.error(`remoteTrainingService.cancelTrialJob: ${error}`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error message is for cancelTrialJob

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@SparkSnail SparkSnail merged commit 392e55f into microsoft:master Oct 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants