Support transformed RVs & track expressions #94

trappmartin · 2020-05-02T13:31:20Z

This is a summary issue of something floating around for quite some time.

We would like to be able to do this:

@model test(x) = begin
    m1 ~ Normal()
    @track m2 = f(m1) # some differentiable transformation
    x ~ Normal(m2, 1.0)
end

The resulting chain should contain states for m1, m2 respectively.

Comments by @mohamed82008 how to approach this:

Define a Deterministic distribution.
Make sure getspace doesn't return any variable symbol whose dists field isa Vector{<:Deterministic}. This is the trickiest part.
Define logpdf for Deterministic to return 0, rand to return the wrapped value and bijector to return Identity.
Define assume for Deterministic such that it writes down the value wrapped in the RHS to vi[vn] calling ForwardDiff.value, ReverseDiff.value or Tracker.data on the wrapped value before writing it to vi[vn]. We should probably define DistributionsAD.value to unify these.

The text was updated successfully, but these errors were encountered:

devmotion · 2020-05-02T19:20:54Z

Even if that might be a way to solve the problem, I'm not sure that's how we should actually do it (sorry 😛). Adding a deterministic distribution and an identity bijector for a transformation seems like a bit of a hack to me (and the AD part seems to be focused on HMC?).

It seems the main purpose of @track would be to ensure that m2 would be part of the chain? Or is there anything else that we have to take care of?

If it's (mainly) about the interesting variables ending up in the chain, maybe

@model function test(x)
    m1 ~ Normal()
    m2 = f(m1) # some differentiable transformation
    x ~ Normal(m2, 1.0)
	return m1, m2
end

and respecting the return statement would be the more natural approach.

trappmartin · 2020-05-02T21:20:22Z

Cool idea to have the return statement determine what is stored "additionally" in the chain.

trappmartin · 2020-05-02T21:22:47Z

That said, I find the use of an @track macro more "user-friendly". But that's rather subjective. Also, @track could be used to inform a sampler that performs marginalisation automatically, that a certain RV should not be marginalised out as we want to track it.

trappmartin · 2020-05-16T11:42:15Z

What do you guys think about using the return statement (as described by @devmotion) in general to define the set of variables that should be tracked? This idea is kinda growing on me. :)

@model function gmm(x, K)
    m ~ filldist(Normal(), K) 
    v ~ filldist(InverseGamma(2, 3), K)
    s = sqrt.(v)
    w ~ Dirichlet(K, 1.0)
    z ~ filldist(Categorical(w), length(x))
    x .~ Normal.(m[z] s[z])
    return m, s, w
end

Then the resulting chain has only states for m, s and w but not for z or v and we can safely marginalise out z in the inference.

torfjelde · 2020-05-16T14:27:19Z

So just a heads up: I've already implemented a generated_quantities for "extracting" return-values from a chain by re-running the model on the values in the chain: https://github.com/cambridge-mlg/Covid19/blob/4d967d58eeada3f3399116fc815cb17fd4163c87/src/utils.jl#L79-L89.

In this case you could then just do:

@model function gmm(x, K)
    m ~ filldist(Normal(), K) 
    v ~ filldist(InverseGamma(2, 3), K)
    s = sqrt.(v)
    w ~ Dirichlet(K, 1.0)
    z ~ filldist(Categorical(w), length(x))
    x .~ Normal.(m[z] s[z])
    return m, s, w
end

m = gmm(x, K)
chain = sample(m, sampler, n)

return_values = generated_quantities(m, chain)

This is a fairly good temporary-fix for tackling this issue + it has additional use-cases:

Predicting. model_train is instantiated on "train-data" but now you want to generate the return-values for model_test which is instantiated on "test-data", then you can execute model_test using the chain (posterior) obtained from model_train.
If you for some reason wanted to compute the "generated values" for model1 using a chain obtained from model2 you could do that, assuming they have the same latent parameters.

Not suggesting this is as a solution to the problem, but it can in many cases be a good fix + it provides some additional functionality.

For completeness:

function generated_quantities(m::Turing.Model, c::MCMCChains.Chains)
    # if `c` is multiple chains we pool them into a single chain
    chain = length(chains(c)) == 1 ? c : MCMCChains.pool_chain(c)

    varinfo = Turing.DynamicPPL.VarInfo(m)

    return map(1:length(chain)) do i
        Turing.DynamicPPL._setval!(varinfo, chain[i])
        m(varinfo)
    end
end

devmotion · 2020-05-16T15:21:18Z

Nice!

IMO both approaches try to address a different problem (which we should both solve in some way). The approach with return (or some other similar model-based approach) tries to avoid saving auxiliary variables completely. This structure would allow samplers to marginalize out variables, which is otherwise not possible if always all variables have to be returned. Your implementation addresses the problem of rerunning the model with different (or the same) variables, which is interesting in itself and provides the tracked expressions if they are returned in the model definition (so you need the same/similar model definition).

Additionally, I think we should just have some convenience function for running models with NamedTuple instead of VarInfo, instead of letting users rewrite the same hacks over and over again. IMO it would be good to use NamedTuples in generated_quantities as well since otherwise reordering of the columns will break the function, which would not be clear to a user.

torfjelde · 2020-05-16T15:29:19Z

Additionally, I think we should just have some convenience function for running models with NamedTuple instead of VarInfo, instead of letting users rewrite the same hacks over and over again.

👍

IMO it would be good to use NamedTuples in generated_quantities as well since otherwise reordering of the columns will break the function, which would not be clear to a user.

Hooold up; what do you mean when you say "reordering of the columns"? Do you mean that generated_quantities will break if the ordering of the resulting vector in the model is wrong? If that's what you mean, _setval internally makes use of the symbols so everything should work just fine even then, right?
With that being said, I 100% support allowing use of NamedTuple instead of having to deal with VarInfo 👍

devmotion · 2020-05-16T15:36:11Z

You're absolutely right - I assumed _setval! just operates on vectors, I guess due to the new indexing behaviour in MCMCChains#dev (IIRC chains[i] just returns an array (with labelled axes)).

ElOceanografo · 2020-08-20T16:24:54Z

+1 for using the return statement to define which variables to track. I think it's clean and intuitive, and the ability to marginalize out variables without storing them in a chain first would be really great.

cpfiffer · 2020-08-20T16:27:21Z

Something I've always wondered about this is why we don't just track these generated quantities in VarInfo? It seems like it would be reasonably easy to build a VarName from the returned quantities and just stick them in some field somewhere.

trappmartin · 2020-08-20T16:36:34Z

I think using the return statement is indeed the best idea. I'm not sure we need to store transformed variables in VarInfo at all, tbh. I'd rather have it only in the resulting trace and keep VarInfo clean.

torfjelde · 2020-08-28T07:30:01Z

Ref: TuringLang/Turing.jl#1335

Thanks to the most recent updates to DPPL (in particular @devmotion's PR with the new impl of setval!) we have a good approach to extracting "generated quantities" in a post-processing step. This is using (an updated version of) generated_quantities method I mentioned above. One issue: at the moment it relies on MCMCChains.pool_chains to handle multiple chains. If we remove that, we could just add generated_quantities to DPPL. I honestly think it's super-useful, and would partially solve a lot of the use-cases for "deterministic tracking" that people are requesting. In most cases executing the model is cheap → re-running the model on a chain as a post-processing step won't be an issue. It also comes with a couple of nice advantages:

No extra memory-usage when sampling.
Sample from model1 and then execute some other model (with the only restriction being that it contains the same parameters in it's parameter-space), say model2, to get the generated quantities of model2 under the posterior of model1.

Thoughts?

trappmartin · 2020-08-28T18:16:51Z

This is essentially a post-processing step, right? I think this is a nice approach, but a more integrated solution that uses the return statement during inference would be great too, as it could help in figuring out possible ways to collapse out parameters.

Maybe we can have your solution implemented for now and add the functionality that I proposed later on. This way one can make use of both 1) possible collapsed sampling, 2) obtaining transformed variables after sampling.

What do you think?

yebai · 2020-08-28T18:36:54Z

In most cases executing the model is cheap → re-running the model on a chain as a post-processing step won't be an issue.

Re-running the model on a big dataset could be expensive.

I think what @torfjelde proposes can be considered a separate feature. It is helpful to have a return statement that can track arbitrary user-provided variables during inference. On top of that, we can provide a generated_quantities function, that can take a chain and a model (can be a different model from which chain is obtained) and produce some post inference analysis.

cpfiffer · 2020-08-28T18:53:30Z

I wonder if we can just do away entirely with generated_quantities. I'd prefer to handle everything at sampling time rather than ex-post, which seems really clumsy to me. For example, couldn't the model compiler convert something like

@model function f()
    m ~ Normal(0, 1)
    z = m + 1
    return z
end

to something where the return value is a NamedTuple as in

@model function f()
    m ~ Normal(0, 1)
    z = m + 1
    return (z=z,)
end

Then, when we run the model through Turing, we just bundle up the NamedTuple we got with the Transition. We already have the machinery to do this here. All we need is a reasonable way to return the generated quantities with their symbols to the model caller in a type-stable way that doesn't distort the model internals that much.

trappmartin · 2020-08-28T18:58:21Z

I thought about Tor’s approach also as an additional functionality. It would also be nice to have a similar function to obtain the log prob values of each datum at each iteration after inference.

torfjelde · 2020-08-29T10:44:21Z

I 100% agree that it would be nice with a more integrated approach where we keep track of the quantities during sampling, but I also 100% disagree with this meaning that generated_quantities is not useful.

Memory-constraints: the "tracked" quantities might require a lot of memory to the point where it's infeasible to keep track of these during sampling. The user can also add a kwarg full::Bool = false to the model-definition if they wanted to exclude certain computation until they run the post-processing step.
Different models under posterior: say you've obtained a chain from model1, thus giving you an estimate of the posterior. You might then want to compute the generated quantities (i.e. the stuff you've put in the return statement) for some other model using the chain you just obtained. This other model, model2, could
1. have the same model-def but different inputs,
2. be a completely different model but with the same latent parameters, or
3. be a new definition of the same model, e.g. user realized they wanted to also look at some other values under the posterior, so they re-define the model-def of model1 and want to the execute the new instantiation on the chain.
Speed: executing the model is likely not the most expensive part (in the case where one uses AD-based samplers), so all the cases listed in (2) is going be muuuch faster to do from an existing chain rather than running the entire sampling process again.
Development cost: generated_quantities exists now and this issue is something that we keep getting asked about. And, because of the points just listed above, it's still going to have a functionality once we get a "better" solution for tracking variables during sampling.

Now, if anyone still disagrees; fight me! I'll be near the fountain behind the gray building at 6pm. No brass knuckles allowed.

(I'm kidding; I've just always wanted to say that. Please continue the discussion in a civil way.)

EDIT: Also, in my original comment I should have emphasized that I did not mean that my proposal was a replacement for tracking during sampling! My bad. But rather an addition + it would provide a temporary "solution" to the problem in most basic use-cases, which I think is important too, given the number of times people have brought this up in our Slack-channel.

EDIT 2: The above is mainly a response to the:

I wonder if we can just do away entirely with generated_quantities.

trappmartin · 2020-08-29T10:49:59Z

Well, I think we all agree that it’s very useful to have, and I think we should provide it as functionally. We should simply not have this as the only solution. I think best would be both, as pointed out earlier.

cpfiffer · 2020-08-29T19:05:53Z

I think those are all great points, particularly the idea about using different models.

The memory constraints thing is not that big an issue though, since I was also thinking of a keyword argument to exclude/include generated quantities during sampling. I think also there's not really that much data that gets thrown around in most common models, so we probably shouldn't be running into memory issues. We could by default just not store any return values but give that option for people who want it.

Maybe the short-term solution here is to give a keyword generate_quantities=true to sample which automatically does a pass to generate quantities and stick them in the returned chain at the end. This way the user doesn't have to explicitly call generated_quantities for the most basic use case. It'd add additional sampling time but I think would be a very nice feature to have.

yebai · 2023-02-02T23:01:42Z

I like @devmotion's proposal above: a clean syntax for specifying tracked variables would be via the return statement:

@model test(x) = begin
    m1 ~ Normal()
    m2 = f(m1) # some differentiable transformation
    x ~ Normal(m2, 1.0)
    return m2 # store returned values in the `AbstractMCMC.bundle_samples` function 
end

trappmartin added the enhancement New feature or request label May 2, 2020

devmotion mentioned this issue Jun 25, 2020

Problems with deterministic distribution TuringLang/Turing.jl#1335

Closed

torfjelde mentioned this issue Jun 25, 2020

Add generated_quantities as a way of re-running a model on a chain #146

Closed

devmotion mentioned this issue Aug 20, 2020

Integrating out variables via Laplace approximation (or other methods) TuringLang/Turing.jl#1382

Closed

bdeonovic mentioned this issue Oct 22, 2020

Specify which variables to track TuringLang/Turing.jl#1444

Closed

phipsgabler mentioned this issue Dec 15, 2020

AbstractVarInfo interface TuringLang/AbstractPPL.jl#1

Closed

ElOceanografo mentioned this issue Jan 21, 2021

Define a functiuon on Multivatate Distributions to marginalize out some variates JuliaStats/Distributions.jl#1238

Open

torfjelde mentioned this issue Mar 27, 2021

Reparameterizations #220

Closed

2 tasks

torfjelde mentioned this issue Apr 26, 2021

[Merged by Bors] - Support for submodels #233

Closed

yebai pinned this issue Dec 16, 2021

yebai mentioned this issue Dec 16, 2021

Retrieving samples of unconstrained variables TuringLang/Turing.jl#1494

Closed

yebai mentioned this issue Sep 1, 2023

Storing returned values in VarInfo. #529

Closed

This was referenced Apr 19, 2024

Adds @is_post_processing macro #589

Closed

:= to keep track of generated quantities #594

Merged

yebai unpinned this issue May 6, 2024

yebai closed this as completed in #594 May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support transformed RVs & track expressions #94

Support transformed RVs & track expressions #94

trappmartin commented May 2, 2020

devmotion commented May 2, 2020

trappmartin commented May 2, 2020

trappmartin commented May 2, 2020 •

edited

Loading

trappmartin commented May 16, 2020

torfjelde commented May 16, 2020

devmotion commented May 16, 2020

torfjelde commented May 16, 2020

devmotion commented May 16, 2020

ElOceanografo commented Aug 20, 2020

cpfiffer commented Aug 20, 2020

trappmartin commented Aug 20, 2020

torfjelde commented Aug 28, 2020

trappmartin commented Aug 28, 2020 •

edited

Loading

yebai commented Aug 28, 2020 •

edited

Loading

cpfiffer commented Aug 28, 2020

trappmartin commented Aug 28, 2020

torfjelde commented Aug 29, 2020 •

edited

Loading

trappmartin commented Aug 29, 2020

cpfiffer commented Aug 29, 2020

yebai commented Feb 2, 2023 •

edited

Loading

Support transformed RVs & track expressions #94

Support transformed RVs & track expressions #94

Comments

trappmartin commented May 2, 2020

devmotion commented May 2, 2020

trappmartin commented May 2, 2020

trappmartin commented May 2, 2020 • edited Loading

trappmartin commented May 16, 2020

torfjelde commented May 16, 2020

devmotion commented May 16, 2020

torfjelde commented May 16, 2020

devmotion commented May 16, 2020

ElOceanografo commented Aug 20, 2020

cpfiffer commented Aug 20, 2020

trappmartin commented Aug 20, 2020

torfjelde commented Aug 28, 2020

trappmartin commented Aug 28, 2020 • edited Loading

yebai commented Aug 28, 2020 • edited Loading

cpfiffer commented Aug 28, 2020

trappmartin commented Aug 28, 2020

torfjelde commented Aug 29, 2020 • edited Loading

trappmartin commented Aug 29, 2020

cpfiffer commented Aug 29, 2020

yebai commented Feb 2, 2023 • edited Loading

trappmartin commented May 2, 2020 •

edited

Loading

trappmartin commented Aug 28, 2020 •

edited

Loading

yebai commented Aug 28, 2020 •

edited

Loading

torfjelde commented Aug 29, 2020 •

edited

Loading

yebai commented Feb 2, 2023 •

edited

Loading