-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug fix to PD3O with stochastic gradient optimisers #2043
base: master
Are you sure you want to change the base?
Bug fix to PD3O with stochastic gradient optimisers #2043
Conversation
@epapoutsellis - In SVRG/LSVRG if we calculate a full gradient snapshot, the results are stored and reused in future approximations. Do we want to store the results of the first or second gradient calculation in a PD3O loop? I presume the second? |
@@ -125,7 +125,19 @@ def update(self): | |||
self.g.proximal(self.x_old, self.gamma, out = self.x) | |||
|
|||
# update step | |||
self.f.gradient(self.x, out=self.x_old) | |||
|
|||
if isinstance(self.f, (SVRGFunction, LSVRGFunction)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is strange that we need this only for SVRG/LSVRG. All the stochastic functions implement an approximate_gradient
method which is called by the gradient
method of the parent class. At least this is how I implemented. I do not think that gradient
method is needed in LSVRG/SVRG.
Also, these lines can be moved in the parent gradient
method.
At the end you will have an if/else statatement for ApproximateGradientSumFunction
and Function
classes to compute the approximate_gradient
and the gradient
respectively
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Vaggelis - I think we wrote SVRG and LSVRG like that, overwriting the gradient
method of the parent class so that the sampler is not called if a full gradient snapshot is calculated. For example, if you are taking a full gradient every 2n iterations in SVRG and you are using a sequential sampler then the 0th subset will be used half as much as any other subset. By rewriting the parent class gradient method we ensure that the sampler is only ever called if an approximate gradient, not a snapshot, is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment about the methods in SVRG/LSVRG. No need to have both gradient
and approximate_gradient
methods in these functions.
For PD3O and stochastic functions is absolutely fine.
Signed-off-by: Margaret Duff <[email protected]>
Changes
f
, in PD3O.Testing you performed
Related issues/links
Closes #2021
Checklist
Contribution Notes
Please read and adhere to the developer guide and local patterns and conventions.