-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] script for distributed ssd training #541
Conversation
eric-haibin-lin
commented
Dec 29, 2018
- added split sampler credit to @hetong007
- added lr scheduler which can be used on dist kvstore server
- added non-blocking hybrid ssd multi box
- added an example script for ssd dist training
fix bugs update kvstore fix step size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this actually lgtm, do you think it's ready for merge except the lint issue?
@@ -106,3 +111,109 @@ def update(self, i, epoch): | |||
(1 + cos(pi * (T - self.warmup_N) / (self.N - self.warmup_N))) / 2 | |||
else: | |||
raise NotImplementedError | |||
|
|||
class DistLRScheduler(lr_scheduler.LRScheduler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to make this PR and #353 compatible?
Besides the scripts, we need a tutorial helping users to set up the environment. In my (limited) experience it is actually more frustrating than implementing the script. |
@hetong007 I agree. I have a quip doc draft for step-by-step training setup but I do not yet have the cycle to polish it for wider audience |
Then maybe we can at least provide a bash file and log for distributed training of resnet50 on imagenet. |
@eric-haibin-lin Will this still be up to date after the unified Dist api modification? apache/mxnet#17010 |
yes #17010 is backward compatible |