Updating 1.0.3 -> 1.0.4 breaks pytorch module hooks #5131

ludwigwinkler · 2020-12-14T16:52:08Z

1.0.3 -> 1.0.4 breaks pytorch-hooks

What is your question?

I implemented a custom optimizer that stores the input and the incoming gradients for each layer and recomputes the gradients in a per-sample fashion (wich I do because I want to use K-FAC for my optimization).

I'm using the pytorch hooks on modules (i.e. module123.register_forward_pre_hook(_save_input_function) which will store the inputs into the state dictionary of the optimizer.
I use the same mod123.register_backward_hook(_save_gradient_function) for the backward gradients) and when calling optimizer.step().

As I updated from 1.0.1 to 1.1.0 it broke the hooks that are required in the KFAC optimizer.
I checked different versions and from 1.0.3 to 1.0.4 something changed such that the Trainer ignores/steps over the hooks in the forward and backward pass.

I checked the release notes but couldn't identify anything essential that would warrant these breaking changes.

Additionally they were issue popping up with manually modifying gradients in other sampling-based types of optimizers.

These issue do not exist for the standard optimizers straight from PyTorch optim class.

Code

It would be a huge codebase as the implementation of KFAC is a bit sophisticated.
But if nothing helps I could write an minimal working example from scratch.

williamFalcon · 2020-12-14T17:16:07Z

@tchaton

williamFalcon · 2020-12-14T17:16:49Z

@ludwigwinkler i suspect using the boring model with an optimizer that overrides a hook will replicate this problem

ludwigwinkler · 2020-12-14T20:55:41Z

Hi,

this is a boring model with an Optimizer that uses forward and backward hooks to do stuff with parameters
https://colab.research.google.com/drive/1z69hmWUybxW58iGMR9EQ3wlT5jwxJxzt?usp=sharing

Let me know if something is unclear/needs more comments or needs polishing.

Thanks for all the effort you put into this project. =)

ludwigwinkler · 2021-01-14T10:11:29Z

@tchaton Did you have time to look at this?

lizhitwo · 2021-01-20T23:57:33Z

I had the same bug, but it turns out to be: when retiring, pytorch_lightning.core.optimizer.LightningOptimizer calls its __del__, which in turn calls KFAC's __del__, which deletes all the hooks.

From pytorch lightning's stand point, it is very horrifying that it automatically calls my optimizer's __del__ even though I am still using my optimizer somewhere.

ludwigwinkler · 2021-01-22T15:11:43Z

@lizhitwo thank you for your insight!
where can I find the deletion call in the source code?
Or should I overwrite my custom optimizers del function?

lizhitwo · 2021-01-22T15:50:28Z

There’s no explicit calling but you can try it by constructing a lightning optimizer that wraps KFAC, and then set it to None while retaining KFAC, and then call garbage collector (gc.collect()). I think you should then see KFAC del called although you still have a KFAC object. On Fri, Jan 22, 2021 at 9:12 AM, ludwigwinkler <[email protected]> wrote: @lizhitwo thank you for your insight! where can I find the deletion call in the source code? Or should I overwrite my custom optimizers del function? —You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.

tchaton · 2021-03-02T10:56:01Z

Dear @ludwigwinkler @lizhitwo,

Great find !
We attached most function from your optimizer to the LightningOptimizer, so it exposes the same attributes and properties.

We didn't thought about __del__ function being triggered and accidentally deleting your hooks.

I couldn't reproduce the bug you shared. The notebook contained an error: The provided closure wasn't being called, so the hooks weren't triggered.

I have made a PR there: #6305. Please, check it does cover your use case.

Sorry for the inconvenience.

Best,
T.C

ludwigwinkler · 2021-04-26T12:04:00Z

Sorry for the long delay.

So I revisited my old code but the deletion of the hooks still persists in version 1.2.6. :(

You can check the colab boring model here:
https://colab.research.google.com/drive/1z69hmWUybxW58iGMR9EQ3wlT5jwxJxzt?usp=sharing

The optimizer's internal state wont be able to reference the results of the hooks which are stored in the internal dictionary.

lizhitwo · 2021-04-28T04:17:58Z

Hi ludwigwinkler, from the stack it seems that your code runs the optimizer step before the closure (def step(self, closure=None)'s closure). Since some time ago, pl needs us users to either pass closure to parent classes to handle, or run the closure ourselves, and the closure is what actually calls training_step, etc. Otherwise no forward pass is even evaluated, and that leads to no hook being evoked.

To pl devs: Is it also a bug that in this tutorial the closure is only evaluated in some iterations, but just ignored (no forward passes) in others?

ludwigwinkler · 2021-05-03T10:58:41Z

Hi lizhitwo,
I'm only familiar with the concept of closures in optimizers like L-BGFS which recompute multiple values and/or states to predict the right step size in optim.step(closure=closure).

Do you mean that the forward pass should be a encapsulated in a closure and subsequently passed to the optimizer such that the forward pass itself is executed within the optimizer?

Thanks for your effort to help me and everybody else out on this,
Ludwig

lizhitwo · 2021-05-03T14:15:42Z

I'm only familiar with the concept of closures in optimizers like L-BGFS

yeah outside pytorch lightning, closure is only useful in LBFGS. In pl, everything uses closure whose purpose I think is to support all optimizers at once, including LBFGS.

Do you mean that the forward pass should be a encapsulated in a closure

that’s already done automatically in pl. Inside your optimizer step, you can try to print the input closure variable, which will be a function instead of none. You can add printing to your training step and your optimizer step to observe this behavior. You should be able to see that the training step was not run if you don’t call closure, but if you call it (closure()) you should see the training step called.

ludwigwinkler · 2021-05-03T15:31:58Z

Wow ... never would have that that simply adding closure() inside my optimizer.step() would solve the entire problem. :)
I wasn't aware of the optimizer functionality in pl. My bad ...

Wonderful! Thank you very much @lizhitwo for your help!

lizhitwo · 2021-05-04T01:31:08Z

Glad to help someone who fell into the same pit as myself. Frankly I think pytorch lightning should throw an warning if closure() is not called in optimizer_step...

ludwigwinkler added the question Further information is requested label Dec 14, 2020

stale bot added the won't fix This will not be worked on label Jan 14, 2021

stale bot removed the won't fix This will not be worked on label Jan 14, 2021

Borda added the bug Something isn't working label Feb 4, 2021

edenlightning added the priority: 0 High priority task label Feb 9, 2021

Borda removed the question Further information is requested label Mar 1, 2021

Lightning-AI deleted a comment from stale bot Mar 1, 2021

Borda assigned tchaton Mar 1, 2021

tchaton mentioned this issue Mar 2, 2021

[bugfix] Check LightningOptimizer doesn't delete optimizer hooks #6305

Merged

11 tasks

tchaton added the waiting on author Waiting on user action, correction, or update label Mar 2, 2021

tchaton closed this as completed in #6305 Mar 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating 1.0.3 -> 1.0.4 breaks pytorch module hooks #5131

Updating 1.0.3 -> 1.0.4 breaks pytorch module hooks #5131

ludwigwinkler commented Dec 14, 2020 •

edited

Loading

williamFalcon commented Dec 14, 2020

williamFalcon commented Dec 14, 2020

ludwigwinkler commented Dec 14, 2020 •

edited

Loading

ludwigwinkler commented Jan 14, 2021

lizhitwo commented Jan 20, 2021 •

edited

Loading

ludwigwinkler commented Jan 22, 2021

lizhitwo commented Jan 22, 2021 via email

tchaton commented Mar 2, 2021 •

edited

Loading

ludwigwinkler commented Apr 26, 2021 •

edited

Loading

lizhitwo commented Apr 28, 2021

ludwigwinkler commented May 3, 2021

lizhitwo commented May 3, 2021

ludwigwinkler commented May 3, 2021 •

edited

Loading

lizhitwo commented May 4, 2021

Updating 1.0.3 -> 1.0.4 breaks pytorch module hooks #5131

Updating 1.0.3 -> 1.0.4 breaks pytorch module hooks #5131

Comments

ludwigwinkler commented Dec 14, 2020 • edited Loading

1.0.3 -> 1.0.4 breaks pytorch-hooks

What is your question?

Code

williamFalcon commented Dec 14, 2020

williamFalcon commented Dec 14, 2020

ludwigwinkler commented Dec 14, 2020 • edited Loading

ludwigwinkler commented Jan 14, 2021

lizhitwo commented Jan 20, 2021 • edited Loading

ludwigwinkler commented Jan 22, 2021

lizhitwo commented Jan 22, 2021 via email

tchaton commented Mar 2, 2021 • edited Loading

ludwigwinkler commented Apr 26, 2021 • edited Loading

lizhitwo commented Apr 28, 2021

ludwigwinkler commented May 3, 2021

lizhitwo commented May 3, 2021

ludwigwinkler commented May 3, 2021 • edited Loading

lizhitwo commented May 4, 2021

ludwigwinkler commented Dec 14, 2020 •

edited

Loading

ludwigwinkler commented Dec 14, 2020 •

edited

Loading

lizhitwo commented Jan 20, 2021 •

edited

Loading

tchaton commented Mar 2, 2021 •

edited

Loading

ludwigwinkler commented Apr 26, 2021 •

edited

Loading

ludwigwinkler commented May 3, 2021 •

edited

Loading