Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Fix parameter file bug #667

Merged
merged 28 commits into from
Jan 28, 2019
Merged

Fix parameter file bug #667

merged 28 commits into from
Jan 28, 2019

Conversation

chicm-ms
Copy link
Contributor

@chicm-ms chicm-ms commented Jan 28, 2019

To fix this multiphase SDK bug: #639

When training service is creating parameter.cfg file, sometimes the file is created but content not flushed to disk yet, during this short time period, SDK detects that the file is created, and try to load the content before the content is flushed by training service.

Solution:
SDK checks parameter.cfg file size > 0

chicm-ms and others added 28 commits November 13, 2018 14:53
Fix nniManager unit test (#515)
* Support distributed job for frameworkcontroller (#612)

support distributed job for frameworkcontroller

* Multiphase doc (#519)

* multiPhase doc

* updates

* updates

* Add time parser for 'nnictl update duration' (#632)

Current nnictl update duration only support seconds unit, add a parser for this command to support {s, m, h, d}

* fix experiment state bug (#629)

* update top README.md (#622)

* Update README.md

* update (#634)

* Integration tests refactoring (#625)

* Integration test refactoring (#21) (#616)

* Integration test refactoring (#21)

* Refactoring integration tests

* test metrics

* update azure pipeline

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* update trigger

* Integration test refactoring (#618)

* updates

* updates

* update pipeline (#619)

* update pipeline

* updates

* updates

* updates

* updates

* updates

* test pipeline (#623)

* test pipeline

* updates

* updates

* updates

* Update integration test (#624)

* Update integration test

* updates

* updates

* updates

* updates

* updates

* updates
This reverts commit 62fc165.
Refactoring nnimanager log (#652)
@@ -69,7 +69,7 @@ def get_next_parameter():
params_filepath = os.path.join(_sysdir, params_file_name)
if not os.path.isfile(params_filepath):
request_next_parameter()
while not os.path.isfile(params_filepath):
while not (os.path.isfile(params_filepath) and os.path.getsize(params_filepath) > 0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks it still exist race condition issue here, because even if the size of params_file is larger than 0, it doesn't mean the param config file is intact.
suggest to use a solution totally resolving this issue, like create a .tmp file for each new config file, and rename this .tmp file after completing writing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will firstly merge this to unblock integration test, then adopt your solution to solve it completely.

Copy link
Contributor

@yds05 yds05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a mitigation solution to unblock integration test, approved.

@chicm-ms chicm-ms merged commit 26bd727 into microsoft:master Jan 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants