Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support transformed RVs & track expressions #94

Closed
trappmartin opened this issue May 2, 2020 · 20 comments · Fixed by #594
Closed

Support transformed RVs & track expressions #94

trappmartin opened this issue May 2, 2020 · 20 comments · Fixed by #594
Labels
enhancement New feature or request

Comments

@trappmartin
Copy link
Member

This is a summary issue of something floating around for quite some time.

We would like to be able to do this:

@model test(x) = begin
    m1 ~ Normal()
    @track m2 = f(m1) # some differentiable transformation
    x ~ Normal(m2, 1.0)
end

The resulting chain should contain states for m1, m2 respectively.

Comments by @mohamed82008 how to approach this:

  • Define a Deterministic distribution.
  • Make sure getspace doesn't return any variable symbol whose dists field isa Vector{<:Deterministic}. This is the trickiest part.
  • Define logpdf for Deterministic to return 0, rand to return the wrapped value and bijector to return Identity.
  • Define assume for Deterministic such that it writes down the value wrapped in the RHS to vi[vn] calling ForwardDiff.value, ReverseDiff.value or Tracker.data on the wrapped value before writing it to vi[vn]. We should probably define DistributionsAD.value to unify these.
@trappmartin trappmartin added the enhancement New feature or request label May 2, 2020
@devmotion
Copy link
Member

Even if that might be a way to solve the problem, I'm not sure that's how we should actually do it (sorry 😛). Adding a deterministic distribution and an identity bijector for a transformation seems like a bit of a hack to me (and the AD part seems to be focused on HMC?).

It seems the main purpose of @track would be to ensure that m2 would be part of the chain? Or is there anything else that we have to take care of?

If it's (mainly) about the interesting variables ending up in the chain, maybe

@model function test(x)
    m1 ~ Normal()
    m2 = f(m1) # some differentiable transformation
    x ~ Normal(m2, 1.0)
	return m1, m2
end

and respecting the return statement would be the more natural approach.

@trappmartin
Copy link
Member Author

Cool idea to have the return statement determine what is stored "additionally" in the chain.

@trappmartin
Copy link
Member Author

trappmartin commented May 2, 2020

That said, I find the use of an @track macro more "user-friendly". But that's rather subjective. Also, @track could be used to inform a sampler that performs marginalisation automatically, that a certain RV should not be marginalised out as we want to track it.

@trappmartin
Copy link
Member Author

What do you guys think about using the return statement (as described by @devmotion) in general to define the set of variables that should be tracked? This idea is kinda growing on me. :)

@model function gmm(x, K)
    m ~ filldist(Normal(), K) 
    v ~ filldist(InverseGamma(2, 3), K)
    s = sqrt.(v)
    w ~ Dirichlet(K, 1.0)
    z ~ filldist(Categorical(w), length(x))
    x .~ Normal.(m[z] s[z])
    return m, s, w
end

Then the resulting chain has only states for m, s and w but not for z or v and we can safely marginalise out z in the inference.

@torfjelde
Copy link
Member

So just a heads up: I've already implemented a generated_quantities for "extracting" return-values from a chain by re-running the model on the values in the chain: https://github.com/cambridge-mlg/Covid19/blob/4d967d58eeada3f3399116fc815cb17fd4163c87/src/utils.jl#L79-L89.

In this case you could then just do:

@model function gmm(x, K)
    m ~ filldist(Normal(), K) 
    v ~ filldist(InverseGamma(2, 3), K)
    s = sqrt.(v)
    w ~ Dirichlet(K, 1.0)
    z ~ filldist(Categorical(w), length(x))
    x .~ Normal.(m[z] s[z])
    return m, s, w
end

m = gmm(x, K)
chain = sample(m, sampler, n)

return_values = generated_quantities(m, chain)

This is a fairly good temporary-fix for tackling this issue + it has additional use-cases:

  1. Predicting. model_train is instantiated on "train-data" but now you want to generate the return-values for model_test which is instantiated on "test-data", then you can execute model_test using the chain (posterior) obtained from model_train.
  2. If you for some reason wanted to compute the "generated values" for model1 using a chain obtained from model2 you could do that, assuming they have the same latent parameters.

Not suggesting this is as a solution to the problem, but it can in many cases be a good fix + it provides some additional functionality.

For completeness:

function generated_quantities(m::Turing.Model, c::MCMCChains.Chains)
    # if `c` is multiple chains we pool them into a single chain
    chain = length(chains(c)) == 1 ? c : MCMCChains.pool_chain(c)

    varinfo = Turing.DynamicPPL.VarInfo(m)

    return map(1:length(chain)) do i
        Turing.DynamicPPL._setval!(varinfo, chain[i])
        m(varinfo)
    end
end

@devmotion
Copy link
Member

Nice!

IMO both approaches try to address a different problem (which we should both solve in some way). The approach with return (or some other similar model-based approach) tries to avoid saving auxiliary variables completely. This structure would allow samplers to marginalize out variables, which is otherwise not possible if always all variables have to be returned. Your implementation addresses the problem of rerunning the model with different (or the same) variables, which is interesting in itself and provides the tracked expressions if they are returned in the model definition (so you need the same/similar model definition).

Additionally, I think we should just have some convenience function for running models with NamedTuple instead of VarInfo, instead of letting users rewrite the same hacks over and over again. IMO it would be good to use NamedTuples in generated_quantities as well since otherwise reordering of the columns will break the function, which would not be clear to a user.

@torfjelde
Copy link
Member

Additionally, I think we should just have some convenience function for running models with NamedTuple instead of VarInfo, instead of letting users rewrite the same hacks over and over again.

👍

IMO it would be good to use NamedTuples in generated_quantities as well since otherwise reordering of the columns will break the function, which would not be clear to a user.

Hooold up; what do you mean when you say "reordering of the columns"? Do you mean that generated_quantities will break if the ordering of the resulting vector in the model is wrong? If that's what you mean, _setval internally makes use of the symbols so everything should work just fine even then, right?
With that being said, I 100% support allowing use of NamedTuple instead of having to deal with VarInfo 👍

@devmotion
Copy link
Member

You're absolutely right - I assumed _setval! just operates on vectors, I guess due to the new indexing behaviour in MCMCChains#dev (IIRC chains[i] just returns an array (with labelled axes)).

@ElOceanografo
Copy link

+1 for using the return statement to define which variables to track. I think it's clean and intuitive, and the ability to marginalize out variables without storing them in a chain first would be really great.

@cpfiffer
Copy link
Member

Something I've always wondered about this is why we don't just track these generated quantities in VarInfo? It seems like it would be reasonably easy to build a VarName from the returned quantities and just stick them in some field somewhere.

@trappmartin
Copy link
Member Author

I think using the return statement is indeed the best idea. I'm not sure we need to store transformed variables in VarInfo at all, tbh. I'd rather have it only in the resulting trace and keep VarInfo clean.

@torfjelde
Copy link
Member

Ref: TuringLang/Turing.jl#1335

Thanks to the most recent updates to DPPL (in particular @devmotion's PR with the new impl of setval!) we have a good approach to extracting "generated quantities" in a post-processing step. This is using (an updated version of) generated_quantities method I mentioned above. One issue: at the moment it relies on MCMCChains.pool_chains to handle multiple chains. If we remove that, we could just add generated_quantities to DPPL. I honestly think it's super-useful, and would partially solve a lot of the use-cases for "deterministic tracking" that people are requesting. In most cases executing the model is cheap → re-running the model on a chain as a post-processing step won't be an issue. It also comes with a couple of nice advantages:

  1. No extra memory-usage when sampling.
  2. Sample from model1 and then execute some other model (with the only restriction being that it contains the same parameters in it's parameter-space), say model2, to get the generated quantities of model2 under the posterior of model1.

Thoughts?

@trappmartin
Copy link
Member Author

trappmartin commented Aug 28, 2020

This is essentially a post-processing step, right? I think this is a nice approach, but a more integrated solution that uses the return statement during inference would be great too, as it could help in figuring out possible ways to collapse out parameters.

Maybe we can have your solution implemented for now and add the functionality that I proposed later on. This way one can make use of both 1) possible collapsed sampling, 2) obtaining transformed variables after sampling.

What do you think?

@yebai
Copy link
Member

yebai commented Aug 28, 2020

In most cases executing the model is cheap → re-running the model on a chain as a post-processing step won't be an issue.

Re-running the model on a big dataset could be expensive.

I think what @torfjelde proposes can be considered a separate feature. It is helpful to have a return statement that can track arbitrary user-provided variables during inference. On top of that, we can provide a generated_quantities function, that can take a chain and a model (can be a different model from which chain is obtained) and produce some post inference analysis.

@cpfiffer
Copy link
Member

I wonder if we can just do away entirely with generated_quantities. I'd prefer to handle everything at sampling time rather than ex-post, which seems really clumsy to me. For example, couldn't the model compiler convert something like

@model function f()
    m ~ Normal(0, 1)
    z = m + 1
    return z
end

to something where the return value is a NamedTuple as in

@model function f()
    m ~ Normal(0, 1)
    z = m + 1
    return (z=z,)
end

Then, when we run the model through Turing, we just bundle up the NamedTuple we got with the Transition. We already have the machinery to do this here. All we need is a reasonable way to return the generated quantities with their symbols to the model caller in a type-stable way that doesn't distort the model internals that much.

@trappmartin
Copy link
Member Author

I thought about Tor’s approach also as an additional functionality. It would also be nice to have a similar function to obtain the log prob values of each datum at each iteration after inference.

@torfjelde
Copy link
Member

torfjelde commented Aug 29, 2020

I 100% agree that it would be nice with a more integrated approach where we keep track of the quantities during sampling, but I also 100% disagree with this meaning that generated_quantities is not useful.

  1. Memory-constraints: the "tracked" quantities might require a lot of memory to the point where it's infeasible to keep track of these during sampling. The user can also add a kwarg full::Bool = false to the model-definition if they wanted to exclude certain computation until they run the post-processing step.
  2. Different models under posterior: say you've obtained a chain from model1, thus giving you an estimate of the posterior. You might then want to compute the generated quantities (i.e. the stuff you've put in the return statement) for some other model using the chain you just obtained. This other model, model2, could
    1. have the same model-def but different inputs,
    2. be a completely different model but with the same latent parameters, or
    3. be a new definition of the same model, e.g. user realized they wanted to also look at some other values under the posterior, so they re-define the model-def of model1 and want to the execute the new instantiation on the chain.
  3. Speed: executing the model is likely not the most expensive part (in the case where one uses AD-based samplers), so all the cases listed in (2) is going be muuuch faster to do from an existing chain rather than running the entire sampling process again.
  4. Development cost: generated_quantities exists now and this issue is something that we keep getting asked about. And, because of the points just listed above, it's still going to have a functionality once we get a "better" solution for tracking variables during sampling.

Now, if anyone still disagrees; fight me! I'll be near the fountain behind the gray building at 6pm. No brass knuckles allowed.

(I'm kidding; I've just always wanted to say that. Please continue the discussion in a civil way.)

EDIT: Also, in my original comment I should have emphasized that I did not mean that my proposal was a replacement for tracking during sampling! My bad. But rather an addition + it would provide a temporary "solution" to the problem in most basic use-cases, which I think is important too, given the number of times people have brought this up in our Slack-channel.

EDIT 2: The above is mainly a response to the:

I wonder if we can just do away entirely with generated_quantities.

@trappmartin
Copy link
Member Author

Well, I think we all agree that it’s very useful to have, and I think we should provide it as functionally. We should simply not have this as the only solution. I think best would be both, as pointed out earlier.

@cpfiffer
Copy link
Member

I think those are all great points, particularly the idea about using different models.

The memory constraints thing is not that big an issue though, since I was also thinking of a keyword argument to exclude/include generated quantities during sampling. I think also there's not really that much data that gets thrown around in most common models, so we probably shouldn't be running into memory issues. We could by default just not store any return values but give that option for people who want it.

Maybe the short-term solution here is to give a keyword generate_quantities=true to sample which automatically does a pass to generate quantities and stick them in the returned chain at the end. This way the user doesn't have to explicitly call generated_quantities for the most basic use case. It'd add additional sampling time but I think would be a very nice feature to have.

@yebai
Copy link
Member

yebai commented Feb 2, 2023

I like @devmotion's proposal above: a clean syntax for specifying tracked variables would be via the return statement:

@model test(x) = begin
    m1 ~ Normal()
    m2 = f(m1) # some differentiable transformation
    x ~ Normal(m2, 1.0)
    return m2 # store returned values in the `AbstractMCMC.bundle_samples` function 
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants