-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyTorch weight_decay
ambiguity with shared params
#1319
Comments
I don't understand. Why is it not possible to fix it? Why not make the params unique? You would just keep a list of visited params, and if some param is already in there, skip it. |
Can you please give a short example in which different modules share common parameters? I didn't know that this was possible (although it would make sense) and a brief demo would help me grasp the concept. In principle we can do what @albertz mentioned, keeping a set of |
Any nested modules will cause this problem, because the parameter access is recursive. Making it not recursive might already solve the problem, but then we can not use strings anymore like it is currently done, because you get name repetitions.
It does not solve this issue, because we are testing for the module name in the blacklist. Thus in my case the same parameter was added in both parameter groups and I got a crash. For sure there are solutions, but as after some minutes I did not have one, I wanted to post this. |
Sure, it will solve it. You will never visit the same param twice. Thus the problem is solved. |
See the code in |
I think there might be a solution if we check only for collisions with respect to the |
Yes sure, just like in |
@JackTemaki please check #1320, hopefully it will fix the issue 🙂 @albertz I used similar functionality as |
As per pytorch/pytorch#2569, the hashing function of a |
Yes the use of |
@albertz got it. Both functions are obtained from >>> t1 = torch.tensor([1])
>>> t2 = torch.tensor([1])
>>> torch.nn.Parameter(t1, requires_grad=False) == torch.nn.Parameter(t2, requires_grad=False)
tensor([True]) I'll use |
It seems my initial posting was not clear. The error is NOT only that we are visiting the same parameter twice, but that the module we use for the blacklist check is different/wrong. If you put in your fix now, |
Why is it wrong? It's anyway a heuristic. There is no "right way" here. As long as it is not crashing, it is fine I would say. Or do you have a suggestion for a better heuristic? |
Lets say it is not "wrong", but a no-op then. If we anyway do not use the blacklist, the whole code can be removed, or keep the "bias" part only. |
I don't understand what you mean. We still have that blacklist? It's working fine for all the normal cases? Why would we remove it? It is just ambiguous for the case you are describing here, i.e. sharing params. As long as the heuristic is deterministic for such (rare) ambiguous case, which it is, it is fine, I would say. I don't see a reason to completely remove the heuristic now. |
I see, I think I understand better now. The solution would then involve something like letting the user specify certain groups of modules they wish to blacklist, as I understand it? |
Fix #1319 Co-authored-by: Albert Zeyer <[email protected]>
This could be another solution. But I would say this should be up to the user. For most users, I think the current heuristic should be fine anyway. |
weight_decay
code includes parameters twice.weight_decay
ambiguity with shared params
I think @JackTemaki's point is that this isn't customizable as of now. Maybe we can work on customizing it? A straight-forward solution could involve adding a key |
So I will leave this open, and I understand this as a feature request now, not as bug anymore (it was a bug that it was crashing, but that was fixed via #1320). I don't think you can say that the current code is wrong now, as it is just a heuristic. You could just say the heuristic is bad. So, I understand this as a feature request to also have other potential heuristics, or maybe the requirement that the user specifies everything explicitly, or so. This should all be configurable then, as soon as we have such other options. |
First of all, it should be possible for the user to specify an own custom function to determine for each param the optimizer group. Something like I'm a bit hesitating to add options for the heuristic. |
I see. Then we can simply let the user specify their own This would work in the PyTorch backend, but I'm wondering if this would have any effect in the TensorFlow net-dict backend (I'm currently thinking of issues when defining a RETURNN frontend config), as we don't have any custom weight decay heuristics set there, do we? But I guess this would be a separate issue. |
Yes. But I think it should be a function, because that is usually more reasonable to specify it. But before starting to implement this now, I would wait until someone says that he/she really wants to have this now.
Why would it? This option is purely for the PT engine, or not?
No heuristic, but the user just specifies it per layer. Every layer has the |
No, no need for that. I just do not understand where this issue ended up, this was at no point about shared params. |
So, how do you solve this problem then?
I don't understand. In your initial post, you write:
So this means shared params, or not? |
What problem? I have no need for a custom blacklist.
No. As I clearly wrote in my second post:
|
You said the current (or original) logic of
Oh, I think I understand now. From this, originally I thought you mean shared parameters. But the fix is easy, or not? Just this line: for pn, p in m.named_parameters(): needs to be changed to: for pn, p in m.named_parameters(recurse=False): That's all, or not? |
Now maybe yes, but in the beginning this would not have solved the issue because how the dicts with string parameter names are handled, not sure for the current code. |
I don't understand how our current code changes anything about this. We just skip duplicate parameters, that's all. Even without this change, using |
The code logic of
_get_optimizer_param_groups
branch with activated weight decay can not run in its current form. Iterating over all named modules while at the same time iterating recursively over parameters will yield the same parameter multiple times with a different module reference.In my case this was:
for
speaker_embedding.weight
This means we need a completely new logic if we want to exclude some modules.
The text was updated successfully, but these errors were encountered: