Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't decorate acquisition function call methods with tf.function #722

Closed
uri-granta opened this issue Mar 31, 2023 · 1 comment · Fixed by #743 or #718
Closed

Don't decorate acquisition function call methods with tf.function #722

uri-granta opened this issue Mar 31, 2023 · 1 comment · Fixed by #743 or #718
Labels
bug Something isn't working

Comments

@uri-granta
Copy link
Collaborator

Trieste's current approach to avoiding retracing is to decorate the __call__ method of acquisition function classes with tf.function. However, this has two negative side effects:

  1. It results in early compilation, restricting use of lazy initialisation of variables
  2. It keeps compiled graphs in a global cache that accumulates over time (e.g. during the test runs)

To fix this we should move the use of tf.functions further out. Note that this will probably be a significant amount of work.

Following this we can also convert some variables from dynamically shaped to lazy initialised (e.g. self._eps).

@uri-granta uri-granta added the bug Something isn't working label Mar 31, 2023
@uri-granta
Copy link
Collaborator Author

So it turns out that both of the assumptions in this ticket were incorrect!

  1. It is possible to use lazy initialised variables even with the current approach (see Make IndependentReparametrizationSampler more XLA friendly #718, though also Fix Batch Reparam Sampler support for varying batch sizes #748).
  2. The graphs are stored on the instances not the class method so are ostensibly freed, though it also turns out some memory is leaked by tensorflow regardless of where the graphs are stored, meaning that runs with many compilations (such as tests) inevitably leak memory. A workaround is to split the test job into multiple runs (see Split test run into three #743 and Parallelise CI tests #749).

Given that the official tensorflow documentation includes examples like the ones in trieste, I recommend closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
1 participant