-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Support remote training service use reuse mode #2923
Conversation
Merge master
merge master
merge master
merge master
merge master
merge master
merge master
Chinese translation (microsoft#2458)
merge master
merge master
merge master
merge master
merge master
merge master
merge master
merge master
merge master
merge master
merge master
|
||
#### reuse | ||
|
||
Optional. Set if use trial_runner to maintan multiple trial. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also not clear. you can describe the benefit when reuse
is set True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
@@ -592,6 +592,14 @@ Specifies the pre-command that will be executed before the remote machine execut | |||
|
|||
__Note__: Because __preCommand__ will execute before other commands each time, it is strongly not recommended to set __preCommand__ that will make changes to system, i.e. `mkdir` or `touch`. | |||
|
|||
### remoteConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a little strange to have both "machineList" and "remoteConfig" in the same level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
machineList is a list type field, I considered merge machineList under remoteConfig, but it may cause compatibility problem.
public abstract stopEnvironment(environment: EnvironmentInformation): Promise<void>; | ||
public abstract startEnvironment(environment: EnvironmentInformation): Promise<void>; | ||
public abstract getInitializeEnvironmentNumber(): number; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- How about merge this function with hasStorageService, environmentMaintenceLoopInterval? they can return a structure.
- getInitializeEnvironmentNumber is a weired name, how about prefetchedEnvironmentCount?
- refer to hasMoreEnvironments to provide a default implementation, and return 0. So that other service don't need to repeat to return 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to use prefetchedEnvironmentCount(). I think it is better split these three data, for different platform has different data type and value. for example, openPAI only need to set hasStorageService, while remote need to set prefetchedEnvironmentCount, split them is more flexible.
@@ -578,6 +578,15 @@ class TrialDispatcher implements TrainingService { | |||
|
|||
} | |||
} | |||
|
|||
private async initEnvironments(): Promise<void> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefetchEnvironments may be more accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
const number = environmentService.getInitializeEnvironmentNumber(); | ||
this.log.info(`Initialize environments total number: ${number}`); | ||
for (let index = 0; index < number; index++) { | ||
await this.requestEnvironment(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may need to be async call, and wait together. So that it can be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a look on code, remote machine starts all envs in config, so it has no effect. But others may implement with a sync way, so it should be async, and wait all here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't get the point, do you mean await Promise.all()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree to do it concurrently (Promise.all) if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remote machine is scheduled sequentially, if request environment simultaneously, there might be potential issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it can be initialized sequentially, it looks no benefit to initialize before running. If there are 1000 machines, it may need 10 minutes, and first runner timeout already.
|
||
constructor(remoteMachineMetaList: RemoteMachineMeta[]) { | ||
assert(remoteMachineMetaList.length > 0); | ||
this.remoteMachineMetaOccupiedMap = new Map<RemoteMachineMeta, boolean>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks only maintain a map. so it doesn't need to be a separated class. Should be in enviornment service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep a extra class is more clearly, while it might cause a little complex in code. Removed in current version, could add them back if have more requirement in the future.
} | ||
} | ||
|
||
private async launchEnvironmentOnScheduledMachine(environment: RemoteMachineEnvironmentInformation): Promise<void> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name is strange. how about prepareEnvironment or launchTrialRunner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prepareEnvironment is already used, this function is used to launcher an environment process, changed to use launcherEnvironment()
* If a trial is finished, release the connection resource | ||
* @param trial remote machine trial job detail | ||
*/ | ||
public releaseTrialResource(environment: EnvironmentInformation): void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be called like releaseEnvironment or stopRunner. Trial is confusing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
private readonly machineExecutorManagerMap: Map<RemoteMachineMeta, ExecutorManager>; | ||
private readonly machineCopyExpCodeDirPromiseMap: Map<RemoteMachineMeta, Promise<void>>; | ||
private readonly environmentExecutorManagerMap: Map<string, ExecutorManager>; | ||
private readonly environmentJobsMap: Map<string, RemoteMachineEnvironmentInformation>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
environmentJobsMap may be unused, and can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
deferred.resolve(false); | ||
} else { | ||
environment.rmMachineMeta = rmMachineMeta; | ||
const copyExpCodeDirPromise = this.machineCopyExpCodeDirPromiseMap.get( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it still useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this promise object is used to copy data to remote machine, and initialized in setClusterMetadata() step. need to use this object to workaround, otherwise setClusterMetadata will cost much time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't find setClusterMetadata, is it replaced by sshConnectionPromises?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
src/nni_manager/training_service/reusable/environments/remoteEnvironmentService.ts
Show resolved
Hide resolved
this.experimentRootDir = getExperimentRootDir(); | ||
this.experimentId = getExperimentId(); | ||
this.log = getLogger(); | ||
this.log.info('Construct remote machine training service.'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line of log is useless. If it's needed, it means one line log should be added at higher level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
return undefined; | ||
} | ||
|
||
private recycleMachineReservation(rmMeta: RemoteMachineMeta): void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's called by one function only, and the name is confused with caller. don't need to be a seperated function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
const isAlive = await executor.isProcessAlive(jobpidPath); | ||
// if the process of jobpid is not alive any more | ||
if (!isAlive) { | ||
this.log.info(`pid in ${jobpidPath} is not alive!`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may need to log remote machine information, or it's hard to find which node is not alive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
if (!isAlive) { | ||
this.log.info(`pid in ${jobpidPath} is not alive!`); | ||
if (fs.existsSync(trialReturnCodeFilePath)) { | ||
const trialReturnCode: string = await executor.getRemoteFileContent(trialReturnCodeFilePath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change names trial
to runner
. It's confusing in the new infra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
* If a environment is finished, release the connection resource | ||
* @param environment remote machine environment job detail | ||
*/ | ||
public releaseEnvironmentResource(environment: EnvironmentInformation): void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private function? If it's proactive stopping, kill the trial runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
} | ||
} | ||
|
||
private async launchEnvironment(environment: RemoteMachineEnvironmentInformation): Promise<void> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
launchEnvironment is similar with startEnvironment. launchRunner or other name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
const remoteEnvironment: RemoteMachineEnvironmentInformation = environment as RemoteMachineEnvironmentInformation; | ||
remoteEnvironment.status = 'WAITING'; | ||
await this.prepareEnvironment(remoteEnvironment); | ||
await this.launchEnvironment(remoteEnvironment); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's called here, and inside of prepareEnvironment also. Is it expected? The logic can be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored.
} | ||
|
||
private async prepareEnvironment(environment: RemoteMachineEnvironmentInformation): Promise<boolean> { | ||
const deferred: Deferred<boolean> = new Deferred<boolean>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessary to use deferred
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
} | ||
|
||
public async refreshEnvironmentsStatus(environments: EnvironmentInformation[]): Promise<void> { | ||
environments.forEach(async (environment) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it better to refresh all environment concurrently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
await executor.killChildProcesses(jobpidPath); | ||
this.releaseEnvironmentResource(environment); | ||
} catch (error) { | ||
this.log.error(`remoteTrainingService.cancelTrialJob: ${error}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the error message is for cancelTrialJob
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
PR is updated to remove gpuScheduler, and start trial_runner process in all machines when initialize process.