-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Conversation
Pull latest code
pull latest code
pull latest code
pull latest code
pull latest code
Pull latest code
pull latest code
pull code
pull code
pull code
pull code
pull code
Fix nniManager unit test (#515)
pull code
pull code
pull code
pull code
pull code
* Support distributed job for frameworkcontroller (#612) support distributed job for frameworkcontroller * Multiphase doc (#519) * multiPhase doc * updates * updates * Add time parser for 'nnictl update duration' (#632) Current nnictl update duration only support seconds unit, add a parser for this command to support {s, m, h, d} * fix experiment state bug (#629) * update top README.md (#622) * Update README.md * update (#634) * Integration tests refactoring (#625) * Integration test refactoring (#21) (#616) * Integration test refactoring (#21) * Refactoring integration tests * test metrics * update azure pipeline * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * updates * update trigger * Integration test refactoring (#618) * updates * updates * update pipeline (#619) * update pipeline * updates * updates * updates * updates * updates * test pipeline (#623) * test pipeline * updates * updates * updates * Update integration test (#624) * Update integration test * updates * updates * updates * updates * updates * updates
This reverts commit 62fc165.
Revert "Pull code"
pull code
Detect tuner failing (#635)
pull code
pull code
Refactoring nnimanager log (#652)
@@ -69,7 +69,7 @@ def get_next_parameter(): | |||
params_filepath = os.path.join(_sysdir, params_file_name) | |||
if not os.path.isfile(params_filepath): | |||
request_next_parameter() | |||
while not os.path.isfile(params_filepath): | |||
while not (os.path.isfile(params_filepath) and os.path.getsize(params_filepath) > 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks it still exist race condition issue here, because even if the size of params_file is larger than 0, it doesn't mean the param config file is intact.
suggest to use a solution totally resolving this issue, like create a .tmp file for each new config file, and rename this .tmp file after completing writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will firstly merge this to unblock integration test, then adopt your solution to solve it completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a mitigation solution to unblock integration test, approved.
To fix this multiphase SDK bug: #639
When training service is creating parameter.cfg file, sometimes the file is created but content not flushed to disk yet, during this short time period, SDK detects that the file is created, and try to load the content before the content is flushed by training service.
Solution:
SDK checks parameter.cfg file size > 0