This repository has been archived by the owner on Sep 18, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Remove duplicate data under /tmp folder, and other small changes. #2484
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
QuanluZhang
reviewed
Jun 8, 2020
maxTrialNum: 2 | ||
trialConcurrency: 2 | ||
maxTrialNum: 1 | ||
trialConcurrency: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add download script at the beginning of each IT?
QuanluZhang
approved these changes
Jun 29, 2020
chicm-ms
approved these changes
Jun 29, 2020
squirrelsc
added a commit
that referenced
this pull request
Jun 30, 2020
Designed new interface to support reusable training service, currently only applies to OpenPAI, and default disabled. Replace trial_keeper.py to trial_runner.py, trial_runner holds an environment, and receives commands from nni manager to run or stop an trial, and return events to nni manager. Add trial dispatcher, which inherits from original trianing service interface. It uses to share as many as possible code of all training service, and isolate with training services. Add EnvironmentService interface to manage environment, including start/stop an environment, refresh status of environments. Add command channel on both nni manager and trial runner parts, it supports different ways to pass messages between them. Current supported channels are file, web sockets. and supported commands from nni manager are start, kill trial, send new parameters; from runner are initialized(support some channel doesn't know which runner connected), trial end, stdout ((new type), including metric like before), version check (new type), gpu info (new type). Add storage service to wrapper a storage to standard file operations, like NFS, azure storage and so on. Partial support run multiple trials in parallel on runner side, but not supported by trial dispatcher side. Other minor changes, Add log_level to TS UT, so that UT can show debug level log. Expose platform to start info. Add RouterTrainingService to keep origianl OpenPAI training service, and support dynamic IOC binding. Add more GPU info for future usage, including GPU mem total/free/used, gpu type. Make some license information consistence. Fix async/await problems on Array.forEach, this method doesn't support async actually. Fix IT errors on download data, which causes by my #2484 . Accelerate some run loop pattern by reducing sleep seconds.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Because there was concurrent issue on downloading pytorch mnist data, so there is trial id in dataset path. But it causes many copies of data on /tmp folder. This fix changes the folder to relative path to avoid duplicate data.
It may bring concurrent issue back on local platform, but not others, but it's mitigated already. First, if data is downloaded, pytorch will verify MD5, and not download it again. Second, pytorch examples run as single instance, and test cases is changed to single instance also.
Small improvements,