Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] script for distributed ssd training #541

Closed
wants to merge 1 commit into from

Conversation

eric-haibin-lin
Copy link
Member

  • added split sampler credit to @hetong007
  • added lr scheduler which can be used on dist kvstore server
  • added non-blocking hybrid ssd multi box
  • added an example script for ssd dist training

fix bugs

update kvstore

fix step size
Copy link
Member

@zhreshold zhreshold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually lgtm, do you think it's ready for merge except the lint issue?

@@ -106,3 +111,109 @@ def update(self, i, epoch):
(1 + cos(pi * (T - self.warmup_N) / (self.N - self.warmup_N))) / 2
else:
raise NotImplementedError

class DistLRScheduler(lr_scheduler.LRScheduler):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to make this PR and #353 compatible?

@hetong007
Copy link
Member

Besides the scripts, we need a tutorial helping users to set up the environment. In my (limited) experience it is actually more frustrating than implementing the script.

@eric-haibin-lin
Copy link
Member Author

@hetong007 I agree. I have a quip doc draft for step-by-step training setup but I do not yet have the cycle to polish it for wider audience

@hetong007
Copy link
Member

Then maybe we can at least provide a bash file and log for distributed training of resnet50 on imagenet.

@zhreshold
Copy link
Member

@eric-haibin-lin Will this still be up to date after the unified Dist api modification? apache/mxnet#17010

@eric-haibin-lin
Copy link
Member Author

yes #17010 is backward compatible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants