Add CustomEngine parameter #1306

JackTemaki · 2023-04-04T17:48:04Z

Allows to define a custom engine in the config that e.g. derives from the existing ones but overrides some functions.

This would be helpful to define very custom training functions and gradient handling, e.g. turn based generator/discriminator settings or direct experimenting with mixed-precision training.

albertz · 2023-04-04T19:04:30Z

I'm not sure this is the best way to do this.

In general, I don't like deriving from classes and overwriting some parts of it too much. It's hard to follow such code, esp if the base class is already complex. But also, it will easily break, as the API between user and RETURNN is not well defined. Basically the current Engine implementation becomes the API. Any changes in any part of that internal implementation can potentially break user code then. So we cannot really make much guarantees on RETURNN side that we might not break some user code. This is maybe ok for a few hacks, when the user never intends to update RETURNN. In practice, those small hacks turn out to be used for many years, and those people using this cannot really update RETURNN anymore then. This is what we want to avoid.

Btw, actually as part of the discussion in #1120, it was often said that if the user wants to implement a very custom training loop or do other very custom scripts, we recommend that the user just writes an own custom script for that. The user would directly call the custom script. The custom script would use from RETURNN whatever is needed. RETURNN should be modular, such that you can take individual pieces. And if some part of RETURNN is maybe difficult to use this way, we should work on that and improve it. Most other frameworks work this way (Keras, PyTorch itself, PyTorch Lightning, whatever) and provide mostly the same functionality as RETURNN, so I think it is definitely possible to design things this way.

PyTorch Lightning is maybe one example which is a bit similar to deriving Engine here. But there, you instead derive LightningModule, and there are a few well defined functions, which you can overwrite to make it more custom. See for example the GAN example.

JackTemaki · 2023-04-05T08:30:29Z

I'm not sure this is the best way to do this.

In general, I don't like deriving from classes and overwriting some parts of it too much. It's hard to follow such code, esp if the base class is already complex. But also, it will easily break, as the API between user and RETURNN is not well defined.

Hmm okay, so then I guess it is best if I work with a local version for now.

Basically the current Engine implementation becomes the API. Any changes in any part of that internal implementation can potentially break user code then. So we cannot really make much guarantees on RETURNN side that we might not break some user code. This is maybe ok for a few hacks, when the user never intends to update RETURNN. In practice, those small hacks turn out to be used for many years, and those people using this cannot really update RETURNN anymore then. This is what we want to avoid.

I am still in favor of versioning and linking experiments to a version. So far we did not manage to avoid people not updating RETURNN, and I am not so sure if we can change that in the future.

Btw, actually as part of the discussion in #1120, it was often said that if the user wants to implement a very custom training loop or do other very custom scripts, we recommend that the user just writes an own custom script for that. The user would directly call the custom script. The custom script would use from RETURNN whatever is needed. RETURNN should be modular, such that you can take individual pieces. And if some part of RETURNN is maybe difficult to use this way, we should work on that and improve it. Most other frameworks work this way (Keras, PyTorch itself, PyTorch Lightning, whatever) and provide mostly the same functionality as RETURNN, so I think it is definitely possible to design things this way.

PyTorch Lightning is maybe one example which is a bit similar to deriving Engine here. But there, you instead derive LightningModule, and there are a few well defined functions, which you can overwrite to make it more custom. See for example the GAN example.

This sounds like I should rather consider using e.g. PyTorch Lightning instead of RETURNN and just import the datasets from RETURNN? I do not think this is the right way. But still, we should keep track of what benefit RETURNN actually provides us.

Anyway, this again goes away from this PR. So I guess I close this for now and figure something out myself.

albertz · 2023-04-05T12:10:03Z

I am still in favor of versioning and linking experiments to a version. So far we did not manage to avoid people not updating RETURNN, and I am not so sure if we can change that in the future.

You can still do that, but still, we should try to make the transition to the newest RETURNN version easy for people, not create any friction there, otherwise people can not really share common code, if it would require some custom set of versions. That means, if possible, we should define APIs in a way that we can keep compatibility easily. I.e., having APIs in a way where it is likely that they would break in the future is a bad idea.

PyTorch Lightning is maybe one example which is a bit similar to deriving Engine here ... GAN example.

This sounds like I should rather consider using e.g. PyTorch Lightning instead of RETURNN and just import the datasets from RETURNN? I do not think this is the right way.

No, I meant that if we want to have an API in RETURNN which allows that the user can define custom training loops, or customize other things, we can look at frameworks like PyTorch Lightning, as inspiration on how to design such API. I just say that we should have a well defined API for this, and just using the EngineBase or other engines as base classes is not a good API. The API should:

Be simple and straightforward to use by the user.
Be well defined, documented.
Designed in a way that it stays forward-compatible, i.e. setups using the API would not break with future RETURNN versions.

What I wrote before though is another alternative, that the user can really write its own custom script, and use parts of RETURNN in a modular way, e.g. using LR scheduling, model checkpoint cleanup logic, model loading/restoring logic, dataset loaders, but otherwise write an own custom training loop. We should try to design things in RETURNN in a way that this is also easy to do. I think that would anyway improve the internal structure of RETURNN, if things are more modular and decoupled.

It's then up to the user, whether the user writes a custom script with custom training loop, or still uses RETURNN and uses such well defined API to define a custom training loop but otherwise keep everything from RETURNN. I don't really know what way is better. I think both ways are maybe valid in certain situations. I think we should make both ways possible.

But still, we should keep track of what benefit RETURNN actually provides us.

Yes, we always should think about this.

When you think about writing an own custom script, RETURNN ideally can provide all the building blocks which help you, like LR scheduling, model checkpoint cleanup logic, model loading/restoring logic, dataset loaders, whatever else, so you can easily put things together just using a few lines of code.

PyTorch Lightning, just like many other frameworks, have a function model.fit(...) which you call in the end. We could have sth similar provided by RETURNN, for users who want to write an own custom script and using building blocks from RETURNN.

When you actually prefer to use the RETURNN entry point and using a config to define things, the question still is valid, i.e. what benefit would RETURNN provide then, what parts should it do to make things easier.

So, coming more back to the topic here: When you want to use the RETURNN entry point, and want a way to define a custom training loop, we should think about a well-defined API for this. We can use PyTorch Lightning as inspiration for such API.

E.g. in the GAN example, or in Own your loop (advanced) the training loop is still unmodified, but they customize the train step, by overwriting training_step. And I think by setting automatic_optimization = False (I'm did not really check the details, maybe I got sth wrong.) We could have the same, or sth similar. Actually similar as the API from #1120, we can have:

def custom_training_step(*, model: Any, extern_data: TensorDict, training_state: Any):
  ...

But the difference to train_step from #1120 is that this custom_training_step would also cover doing the actual loss.backward() call, the optimizer logic, etc. (I think this is what automatic_optimization = False does in PT Lightning.) So you really would do everything here. In training_state, you could store the optimizer state, or whatever you want.

This is just a very initial suggestion. Maybe you have a better idea. But many other people seem to like the way how you can customize things in PT Lightning, so if we want to do things differently, we should have good reasons. Or if we do not want to think too much about it, we could just follow PT Lightning more or less.

JackTemaki · 2023-04-05T12:39:48Z

This is just a very initial suggestion. Maybe you have a better idea. But many other people seem to like the way how you can customize things in PT Lightning, so if we want to do things differently, we should have good reasons.

On the first glance, I do not really see how this is different to providing a custom engine. This would also only be:

from returnn.torch.engine import Engine

CustomEngine(Engine):

 def train_step(....)
     # custom train step stuff

What parts can be modified is then of course an Engine-API thing, but I do not see how the logic of custom hooks is really different to a custom engine.

albertz · 2023-04-05T12:58:42Z

The difference is the API.

In one case, you really only allow to define custom_train_step, or other well-defined functions. And the state is all well-defined (training_state in the example). The user can not really mess around with other internal parts from the RETURNN engine. The API would not allow this. You can easily support the same API in a future RETURNN version, even when refactoring everything internally.

In the other case, the user can very easily touch anything from the Engine, and it is very easy that the users code will depend on some internal implementation detail of Engine. So, when some internal refactoring is done on RETURNN side, it is very likely that it would break some users code. It is not really possible to guarantee that the API stays compatible in future RETURNN versions, as all the internal details are part of the API now.

In terms of functionality for the user, there is no real difference. But in terms maintainability, and future compatibility, this is a really big difference. I would prefer to have a well-defined API.

Add CustomEngine parameter

43c0f8b

JackTemaki requested review from a team and albertz as code owners April 4, 2023 17:48

JackTemaki closed this Apr 5, 2023

albertz deleted the nick-custom-engine branch April 5, 2023 12:10

albertz mentioned this pull request Apr 19, 2023

Frontend API and PyTorch backend #1120

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CustomEngine parameter #1306

Add CustomEngine parameter #1306

JackTemaki commented Apr 4, 2023

albertz commented Apr 4, 2023

JackTemaki commented Apr 5, 2023

albertz commented Apr 5, 2023

JackTemaki commented Apr 5, 2023

albertz commented Apr 5, 2023

Add CustomEngine parameter #1306

Add CustomEngine parameter #1306

Conversation

JackTemaki commented Apr 4, 2023

albertz commented Apr 4, 2023

JackTemaki commented Apr 5, 2023

albertz commented Apr 5, 2023

JackTemaki commented Apr 5, 2023

albertz commented Apr 5, 2023