Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating 1.0.3 -> 1.0.4 breaks pytorch module hooks #5131

Closed
ludwigwinkler opened this issue Dec 14, 2020 · 14 comments · Fixed by #6305
Closed

Updating 1.0.3 -> 1.0.4 breaks pytorch module hooks #5131

ludwigwinkler opened this issue Dec 14, 2020 · 14 comments · Fixed by #6305
Assignees
Labels
bug Something isn't working priority: 0 High priority task waiting on author Waiting on user action, correction, or update

Comments

@ludwigwinkler
Copy link

ludwigwinkler commented Dec 14, 2020

1.0.3 -> 1.0.4 breaks pytorch-hooks

What is your question?

I implemented a custom optimizer that stores the input and the incoming gradients for each layer and recomputes the gradients in a per-sample fashion (wich I do because I want to use K-FAC for my optimization).

I'm using the pytorch hooks on modules (i.e. module123.register_forward_pre_hook(_save_input_function) which will store the inputs into the state dictionary of the optimizer.
I use the same mod123.register_backward_hook(_save_gradient_function) for the backward gradients) and when calling optimizer.step().

As I updated from 1.0.1 to 1.1.0 it broke the hooks that are required in the KFAC optimizer.
I checked different versions and from 1.0.3 to 1.0.4 something changed such that the Trainer ignores/steps over the hooks in the forward and backward pass.

I checked the release notes but couldn't identify anything essential that would warrant these breaking changes.

Additionally they were issue popping up with manually modifying gradients in other sampling-based types of optimizers.

These issue do not exist for the standard optimizers straight from PyTorch optim class.

Code

It would be a huge codebase as the implementation of KFAC is a bit sophisticated.
But if nothing helps I could write an minimal working example from scratch.

@ludwigwinkler ludwigwinkler added the question Further information is requested label Dec 14, 2020
@williamFalcon
Copy link
Contributor

@tchaton

@williamFalcon
Copy link
Contributor

@ludwigwinkler i suspect using the boring model with an optimizer that overrides a hook will replicate this problem

@ludwigwinkler
Copy link
Author

ludwigwinkler commented Dec 14, 2020

Hi,

this is a boring model with an Optimizer that uses forward and backward hooks to do stuff with parameters
https://colab.research.google.com/drive/1z69hmWUybxW58iGMR9EQ3wlT5jwxJxzt?usp=sharing

Let me know if something is unclear/needs more comments or needs polishing.

Thanks for all the effort you put into this project. =)

@stale stale bot added the won't fix This will not be worked on label Jan 14, 2021
@ludwigwinkler
Copy link
Author

@tchaton Did you have time to look at this?

@stale stale bot removed the won't fix This will not be worked on label Jan 14, 2021
@lizhitwo
Copy link

lizhitwo commented Jan 20, 2021

I had the same bug, but it turns out to be: when retiring, pytorch_lightning.core.optimizer.LightningOptimizer calls its __del__, which in turn calls KFAC's __del__, which deletes all the hooks.

From pytorch lightning's stand point, it is very horrifying that it automatically calls my optimizer's __del__ even though I am still using my optimizer somewhere.

@ludwigwinkler
Copy link
Author

@lizhitwo thank you for your insight!
where can I find the deletion call in the source code?
Or should I overwrite my custom optimizers del function?

@lizhitwo
Copy link

lizhitwo commented Jan 22, 2021 via email

@Borda Borda added the bug Something isn't working label Feb 4, 2021
@edenlightning edenlightning added the priority: 0 High priority task label Feb 9, 2021
@Borda Borda removed the question Further information is requested label Mar 1, 2021
@Lightning-AI Lightning-AI deleted a comment from stale bot Mar 1, 2021
@tchaton
Copy link
Contributor

tchaton commented Mar 2, 2021

Dear @ludwigwinkler @lizhitwo,

Great find !
We attached most function from your optimizer to the LightningOptimizer, so it exposes the same attributes and properties.

We didn't thought about __del__ function being triggered and accidentally deleting your hooks.

I couldn't reproduce the bug you shared. The notebook contained an error: The provided closure wasn't being called, so the hooks weren't triggered.

I have made a PR there: #6305. Please, check it does cover your use case.

Sorry for the inconvenience.

Best,
T.C

@tchaton tchaton added the waiting on author Waiting on user action, correction, or update label Mar 2, 2021
@ludwigwinkler
Copy link
Author

ludwigwinkler commented Apr 26, 2021

Sorry for the long delay.

So I revisited my old code but the deletion of the hooks still persists in version 1.2.6. :(

You can check the colab boring model here:
https://colab.research.google.com/drive/1z69hmWUybxW58iGMR9EQ3wlT5jwxJxzt?usp=sharing

The optimizer's internal state wont be able to reference the results of the hooks which are stored in the internal dictionary.

@lizhitwo
Copy link

Hi ludwigwinkler, from the stack it seems that your code runs the optimizer step before the closure (def step(self, closure=None)'s closure). Since some time ago, pl needs us users to either pass closure to parent classes to handle, or run the closure ourselves, and the closure is what actually calls training_step, etc. Otherwise no forward pass is even evaluated, and that leads to no hook being evoked.

To pl devs: Is it also a bug that in this tutorial the closure is only evaluated in some iterations, but just ignored (no forward passes) in others?

@ludwigwinkler
Copy link
Author

Hi lizhitwo,
I'm only familiar with the concept of closures in optimizers like L-BGFS which recompute multiple values and/or states to predict the right step size in optim.step(closure=closure).

Do you mean that the forward pass should be a encapsulated in a closure and subsequently passed to the optimizer such that the forward pass itself is executed within the optimizer?

Thanks for your effort to help me and everybody else out on this,
Ludwig

@lizhitwo
Copy link

lizhitwo commented May 3, 2021

I'm only familiar with the concept of closures in optimizers like L-BGFS

yeah outside pytorch lightning, closure is only useful in LBFGS. In pl, everything uses closure whose purpose I think is to support all optimizers at once, including LBFGS.

Do you mean that the forward pass should be a encapsulated in a closure

that’s already done automatically in pl. Inside your optimizer step, you can try to print the input closure variable, which will be a function instead of none. You can add printing to your training step and your optimizer step to observe this behavior. You should be able to see that the training step was not run if you don’t call closure, but if you call it (closure()) you should see the training step called.

@ludwigwinkler
Copy link
Author

ludwigwinkler commented May 3, 2021

Wow ... never would have that that simply adding closure() inside my optimizer.step() would solve the entire problem. :)
I wasn't aware of the optimizer functionality in pl. My bad ...

Wonderful! Thank you very much @lizhitwo for your help!

@lizhitwo
Copy link

lizhitwo commented May 4, 2021

Glad to help someone who fell into the same pit as myself. Frankly I think pytorch lightning should throw an warning if closure() is not called in optimizer_step...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: 0 High priority task waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants