Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classification Model as Output Constraint #725

Open
jduerholt opened this issue Mar 1, 2021 · 27 comments
Open

Classification Model as Output Constraint #725

jduerholt opened this issue Mar 1, 2021 · 27 comments
Labels
enhancement New feature or request

Comments

@jduerholt
Copy link
Contributor

Dear botorch developers,

I have a question regarding output constraints. So far they are used and implemented in the following way:

  • There is a property which should be larger than a user provided threshold.
  • A GP regression model is build to model the property. The output of the regression model is used as input for a sigmoid function that is chosen in a way that it is 1 above the thresshold.
  • The actual optimization target is then multiplied by the output of the sigmoid function and this product forms the actual objective function
  • The combination of regression model and sigmoid can be interpreted as a classification model --> related to logistic regression

Now comes my question:

How to set up a optimization in which a "real" classification model is needed. Imagine achemical experiment in which only certain input combinations lead to a "successful" experiments. Successful in this context means that it is possible to measure the properties that should be optimized. For the failed experiments, no outputs are available, and within the BO loop one wants to propose only experiments that lead to succesful experiments.

I found one paper in this context. They are using an in the loop classifier to judge if a experiment will be succesful or not. I think there should be better solutions using the technique on the output constraints described above. One could for example train directly a classification model using gpytorch and multiply its output with the actual objective.

What are your thoughts on this and have you ever tried something related within botorch?

Best,

Johannes

@Balandat
Copy link
Contributor

Balandat commented Mar 1, 2021

Hi @jduerholt, great question.

Using a Probit classification model with a Bernoulli likelihood for actual classification should work in this context. You're right that since such model maps the latent function directly to a probability of success, one should be able to use the same approach as with the existing constrained formulation, but using this success probability directly (here it is: https://github.com/cornellius-gp/gpytorch/blob/master/gpytorch/likelihoods/bernoulli_likelihood.py#L26).

We could probably just use the vanilla GP Regression model from https://github.com/cornellius-gp/gpytorch/blob/master/examples/04_Variational_and_Approximate_GPs/Non_Gaussian_Likelihoods.ipynb to prototype something (note that this uses variational inference). The main reason we haven't tried this out is that we so far didn't run into any concrete use cases in our work, but it would be interesting to try this out.

I recall reading a paper that did this within the last year or two, but I can't find it right now.

An interesting question is whether one would want to use a multi-task model for modeling the latent constraint function and the objective(s), the rationale being that performance and feasibility may be quite correlated. But that's a refinement and not necessary as a first step.

@jduerholt
Copy link
Contributor Author

Hi @Balandat,

thanks for the quick response. A prototype based on a single task model would be sufficient for the start, if this works properly one could extend it to a multi-task model, which could boost performance again.

Would you do the prototyping, or is it too far from the typical Facebook use cases?

Best,

Johannes

@Balandat
Copy link
Contributor

Balandat commented Mar 4, 2021

I can throw something together in a notebook on some toy problem, kind of curious myself how this works. Might take a bit to get this cleaned up and make it conveniently usable though.

@Balandat
Copy link
Contributor

Balandat commented Mar 4, 2021

Made some initial strides, making a BoTorch model based on the simple classification GP is easy, hooking things up into the model API will require a bit of thinking since it wasn't initially designed for classification models. We can probably (i) implement a Posterior that generically represents probability values, and then (ii) make some abstraction that allows to easily run BO for some outcome model (with standard posterior) subject to one or more constraint models (with the new probability posterior. This should at least cover the case of modeling the outcome and the feasibility independently.

@jduerholt
Copy link
Contributor Author

Hi @Balandat,

this sounds like a plan. Modelling outcome and feasibility independently looks like the key for me.

Another possibility for cetain cases could be to X values in the objective to run them through a simple classifier (like logistic regression) at this point. But this violates somehow the way botorch is designed.

Thanks for your efforts. If we get it running properly this could be from my perspective also something publisheable since I have not found so many things on this topic.

Best,

Johannes

@Balandat
Copy link
Contributor

Balandat commented Mar 4, 2021

wondering if @bletham or @mshvartsman have any thoughts here, as they were working on similar / related things.

@bletham
Copy link
Contributor

bletham commented Mar 5, 2021

Yeah ultimately in constrained EI the constraint shows up in the acquisition function as Probability(x is feasible). In the usual case where constraints are continuous valued and observed this probability can be computed from a GP, but it could just as easily be a probability from a logistic regression. The challenge is just differentiability and being able to backprop through that Prob(feasible).

My only comment would be that we've done something along these lines with a GP classifier, in which a GP is used to model the latent probability and then it gets a Bernoulli likelihood on top. And that doesn't work well in a lot of practical settings because the GP will assume smoothness in the latent fail/success probabilities, and often in real problems there are in reality hard edges where a small changes in parameters can take it from failing every time to succeeding every time. So something like an SVM or RandomForest classifier tends to work better for those types of problems. But then you can't get the backprop inside Botorch.

@Balandat
Copy link
Contributor

Balandat commented Mar 5, 2021

@jduerholt, curious if you have any intuition as to whether in your case there is reason to expect / not expect smoothness in the probability of feasibility.

@jduerholt
Copy link
Contributor Author

Thanks for the comments @bletham. @Balandat, I will take some time next week to investigate which type of classifier fits best to the problem and if one can expect smoothness. When I have the answer, I come back to you and we can discuss what is then the best solution.

Have a nice weekend!

Best,

Johannes

@mshvartsman
Copy link
Contributor

I probably don't have as much experience as @bletham with constrained BO of this flavor, so consider all of this with a grain of salt, but:

  1. You don't have to use an RBF kernel, right? Something like Matern might be more reasonable if you're not expecting as much smoothness.
  2. I think that while formally you may have smoothness in f(x), with some creativity with the outputscale prior you may be able to get pretty sharp cutpoints numerically on the probability phi(f(x)). In the limit where the scale on f(x) is very large, phi(f(x)) will always be 0 or 1. You could also try other transformations before the probit to really jam the probability towards 0 and 1 faster if you think that's necessary.

I'd probably try getting the differentiable GP classifier to work first even if you have doubts about smoothness, before going down the harder path of non-differentiable constraint modeling.

@bletham
Copy link
Contributor

bletham commented Mar 8, 2021

The issue we ran into when trying to do something very similar to this was that a sharp transition from p=0 to p=1 (always fails to always succeeds) forced the GP to fit a very short lengthscale, which then meant it couldn't generalize. But I agree that there are GP formulations (perhaps with a non-stationary kernel) that probably could have worked.

@mshvartsman
Copy link
Contributor

Good point! The setting where we've seen reasonable behavior wasn't stationary, strictly speaking. Non-stationary kernel or input warping seems like it might be useful if one were to try this.

@Balandat
Copy link
Contributor

Balandat commented Dec 8, 2021

Adding @dme65 and @jtwilson who have been making strides towards this recently.

@JoseValenR
Copy link

Hi all, I was also looking for handling experiment failures and came across this issue.
Is there any plan for implementing it?

@Balandat
Copy link
Contributor

Yeah we're working on implementing this with a new probabilistic classification model. I don't have a timeline for open source for this yet, but hopefully we have something preliminary ready in a couple of months or so.

If you need something quicker it may make sense to try to hook something like this up: https://github.com/cornellius-gp/gpytorch/blob/master/examples/04_Variational_and_Approximate_GPs/PolyaGamma_Binary_Classification.ipynb

@JoseValenR
Copy link

Look forward to it... in the meantime I ll look into the example provided

@jduerholt
Copy link
Contributor Author

Same here, looking forward to it. This will be very useful for some use cases ;)

@esantorella esantorella added the enhancement New feature or request label Jan 30, 2023
@jduerholt
Copy link
Contributor Author

Coming back to this issue after a long time.

Let's assume that we have an arbitrary differentiable classification model $g$, which returns as function of $x$ a vector of class probabilities (which sum up to 1). Furthermore we have a bit vector of the same length which encodes the desirable class for the constrained optimization.

An example: $g(x) = [0.8, 0.1, 0.1]$, the desirability matrix is given by $d = [1, 0, 0]$ which means that we only desirable outcome is class 1. Then the overall desirable probability is given by the dot product of the two vectors $p = g(x) d^T$. We want to use this probability directly as an outcome constraint in an optimization. To do so, one has to somehow invert the sigmoid operation, this can be achieved by pushing $p$ through $log(1/p -1)$. When then sigmoid is applied to this term this results in an identitiy operation for 0<p<1. For the edge case of $p$ approaching zero or 1 one needs to introduce some numeric offset which is neglected here.

From my perspective, this is a lot of overhead for an indentity operation, as we directly want to use the probability $p$ for the output constraint. So my question is, do you have an idea how to do this more elegantly or how to implement this indentity constraint handling in the best way directly in botorch?

cc @gmancino

@jduerholt
Copy link
Contributor Author

@Balandat: any thoughts on this?

@Balandat
Copy link
Contributor

Sorry for the delay, busy times...

We want to use this probability directly as an outcome constraint in an optimization. To do so, one has to somehow invert the sigmoid operation

Hmm I am not sure I understand this part. Why do you need to invert the sigmoid operation?

@jduerholt
Copy link
Contributor Author

Hmm, perhaps this was unclear. Let me try again ;)

The desired probability $p$ should be directly used as output constraint, because when it is 0 it means the output constraint is not fulfilled and if it is 1 then it is fulfilled, but botorch always put a sigmoid operation upon an output constraint which is totally fine for regression, as the sigmoid transforms the regression problem in some kind of classification problem (as in logistic regression) which classifies the regions in which the constraint is fulfilled and in which it is not fulfilled.

In our case, I have already this probability and do not need any sigmoid operation. For this reason, one can transform the desired probability $p$ as shown above before applying the sigmoid which results then in an identity operation, but with an extra in mathematical operations and at the risk of numerical instabilities.

Hope, this makes it clearer ...

@Balandat
Copy link
Contributor

I think so? I guess my first idea here would be to just extract the latent (un-sigmoidified) model prediction and pass that in as the outcome together with the transformed bound. One way of doing that could be to just wrap the model. I guess I'm not sure what classification model exactly you're using if you have a pointer that would be helpful.

@FrankWanger
Copy link
Contributor

Hey all, I am quite interested in this topic as I am working on wet-lab experimentation. Following your idea I've managed to draft something using GPyTorch's approximate GP and extracting the likelihood and convert it back to probability by inversing it with sigmoidal transformation. After that I just feed this probability to the contraint argument of the acqf.

I wanted to contribute to BoTorch community but it's a first time thing for me, I was wondering if I can send a pull request to add a tutorial on this topic, following the style of existing ones? (or should i email the moderator first?) Thank you :)

@saitcakmak
Copy link
Contributor

Hi @FrankWanger. Thanks for your interest. Feel free to submit a PR adding this under notebooks_community.

@jduerholt
Copy link
Contributor Author

@FrankWanger: maybe this paper from ICML24 could be also of interest for you; https://arxiv.org/abs/2402.07692.

@FrankWanger
Copy link
Contributor

@FrankWanger: maybe this paper from ICML24 could be also of interest for you; https://arxiv.org/abs/2402.07692.

Fab! Will add ref to it. Thanks ;)

@jduerholt
Copy link
Contributor Author

We implemented also the following solution to be able to directly use the probabilities for the individual classes (after a softmax) in a sigmoid based output constraint by implementing a function that is multiplied with the probabilities which is then giving again the probablities when it is multiplied with the sigmoid transform, as a result you are multiplying the probabilities with an identity operation within the botorch output constraint. Have a look here: https://github.com/experimental-design/bofire/blob/4d6d6c6fb3b24b040dda5f4884b3f1e84b5cf76c/bofire/utils/torch_tools.py#L302 and here https://github.com/experimental-design/bofire/blob/4d6d6c6fb3b24b040dda5f4884b3f1e84b5cf76c/tests/bofire/utils/test_torch_tools.py#L1084

facebook-github-bot pushed a commit that referenced this issue Jan 28, 2025
…model (#2700)

Summary:
## Motivation

There is no current tutorial on using the probability pulled from classification results as constraints in acquisition functions. And such an application holds strong interest from BO guided laboratory experimentation. A prior discussion (#725) was formed.

### Have you read the [Contributing Guidelines on pull requests](https://github.com/pytorch/botorch/blob/main/CONTRIBUTING.md#pull-requests)?

Yes

Pull Request resolved: #2700

Test Plan:
In the present tutorial we show how to deal with feasibility constraints that are observed alongside the optimization process (referred to as 'outcome constriants' in BoTorch document, or sometimes as 'black-box constraints'). More specifically, the feasibility is modelled by a classification model, followed by feeding this learned probability to the acquisition funtion through the `constraint` argument in `SampleReducingMCAcquisitionFunction`. Namely, this is achieved through re-weighting the acquisition function by $\alpha_{\text{acqf-con}}=\mathbb{P}(\text{Constraint satisfied})*\alpha_{\text{acqf}}$. To achieve this, the pulled probability of classification model underwent un-sigmoid function and was inversed to fit into the API (as negative values treated as feasibility).

A 2D syntheic problem of [Townsend function](https://www.chebfun.org/examples/opt/ConstrainedOptimization.html) was used. For the classification model, we implemented [approximate GP](https://docs.gpytorch.ai/en/v1.13/examples/04_Variational_and_Approximate_GPs/Non_Gaussian_Likelihoods.html) with a Bernoulli likelihood. `qLogExpectedImprovement` was selected as the aquisition function.

Below are the plots of the problem landscape, acquisition function value, constraint probability, and the EI value (before weighting) at different iterations:

At iter=1:
![image](https://github.com/user-attachments/assets/de4118dd-9142-438c-acf8-6afcd5459543)

At iter=10:
![image](https://github.com/user-attachments/assets/b7abb75d-945f-47e9-8364-4064c8236e4b)

At iter=50:
![image](https://github.com/user-attachments/assets/5a147ded-cedf-47aa-9e17-4444c22a0165)

The log regret after 50 iterations are plotted against random (sobel).
![image](https://github.com/user-attachments/assets/73e6435a-d19d-48f8-852c-69c9f5e1daa3)

All images can be reproduced by the notebook.

## Related PRs

not related to any change of functionality

Reviewed By: saitcakmak

Differential Revision: D68752722

Pulled By: Balandat

fbshipit-source-id: 233d5d09ee4b3fe200fe416c90999090bde8e5c0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants