Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sBG model notebook for discrete/contractual setting #32

Closed
wants to merge 19 commits into from

Conversation

drbenvincent
Copy link
Contributor

@drbenvincent drbenvincent commented May 6, 2022

Addressing #25, here is a notebook to demonstrate the sBG model for the discrete/contractual setting.

As well as a general review, here are a few questions or things to think about

  • feedback on the maths and implementation would be appreciated. We seem to get estimation bias.
  • I'm happy to develop a pm.TruncatedGeometric distribution as part of this pull request, but equally it could be a separate thing. (EDIT: I think this can be done after the notebook is at least semi finished and we have faith that it's working as intended)

On my TODO list is:

  • Add a summary, particularly highlighting the (possibly wrong) interpretation of a posterior over theta means.
  • Add code + plot to actually calculate lifetime value distribution.
  • Better function to generate synthetic data, with actual randomness
  • add assertion that we have >1 observation
  • Resolve the estimation bias for multiple cohorts
  • Add model: theta ~ cohort
  • Add model: theta ~ year
  • Add model: theta ~ cohort + year
  • explain why adding year as a predictor would totally change the model

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Comment on lines 1 to 2
import pymc as pm
import numpy as np
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are using isort for our code style the imports have to be in alphabetical order:

import numpy as np
import pymc as pm

If you run make lint inside the conda environment this is also done automatically. I would suggest to use the pre-commit hooks to automate these checks ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment I just have black set up in VS Code. Will try to find time to look into this, but could be efficient to spend 5 mins on this in a call at some point.

import matplotlib.pyplot as plt


def plot_xY(x, Y, ax=None):
Copy link
Collaborator

@juanitorduz juanitorduz May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we stick with the lower case convention for functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have renamed plot_hdi_func

@juanitorduz
Copy link
Collaborator

I have not looked into the notebook yet (I will of course). I left some comments on code style. In addition, I would suggest we add the plot_utils.py and custom_distributions into a the pymmmc module (as they are not notebooks).

Comment on lines 6 to 9
"""Plot the posterior mean and 95% and 50% CI's from a given set of x and Y values.
x: is a vector
Y: is an xarrray of size (chain, draw, dim)
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we keep using the numpy docstrings guide https://numpydoc.readthedocs.io/en/latest/format.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've attempted to improve these

To make it easier to translate into a custom distribution when the time comes
@drbenvincent
Copy link
Contributor Author

I have not looked into the notebook yet (I will of course). I left some comments on code style. In addition, I would suggest we add the plot_utils.py and custom_distributions into a the pymmmc module (as they are not notebooks).

Agreed. And thanks for the comments so far

@drbenvincent
Copy link
Contributor Author

After having installed the pre-commit hooks I can't commit anything until I can figure this out

[WARNING] Unstaged files detected.
[INFO] Stashing unstaged files to /Users/benjamv/.cache/pre-commit/patch1651872664-67063.
black....................................................................Passed
flake8...................................................................Failed
- hook id: flake8
- exit code: 1

Executable `flake8` not found

isort....................................................................Failed
- hook id: isort
- exit code: 1

Executable `isort` not found

Debug Statements (Python)................................................Passed
Trim Trailing Whitespace.................................................Passed
Fix End of Files.........................................................Passed
Check Yaml...........................................(no files to check)Skipped
Check for added large files..............................................Passed
[INFO] Restored changes from /Users/benjamv/.cache/pre-commit/patch1651872664-67063.

I pip installed both flake8 and isort, ran pre-commit run --all and everything passes. Might need some advice @juanitorduz

@juanitorduz
Copy link
Collaborator

juanitorduz commented May 6, 2022

For now you can try git commit -m"my message" -n to skip the checks (see https://stackoverflow.com/questions/7230820/skip-git-commit-hooks)

I'll check the error message tomorrow.

@drbenvincent
Copy link
Contributor Author

NOTE TO SELF: remove the hierarchical inference on the multiple cohort model. There is no point when there is just a single theta?

@review-notebook-app
Copy link

review-notebook-app bot commented May 8, 2022

View / edit / reply to this conversation on ReviewNB

larryshamalama commented on 2022-05-08T17:36:01Z
----------------------------------------------------------------

In your n array, are the 21 individuals explicitly not yet churned? I am going through the 2x2 grid and my understanding is that the subtle difference between discrete contractual and non-contractual settings would be how we characterize the 21 last individuals and how they contribute to the likelihood expression.


drbenvincent commented on 2022-05-09T07:54:33Z
----------------------------------------------------------------

So in the last time period we know that we had 21 subscribers active, but we do not know how many will churn in the current time period.

Copy link
Contributor Author

So in the last time period we know that we had 21 subscribers active, but we do not know how many will churn in the current time period.


View entire conversation on ReviewNB

churned_in_period_t * (at.log(theta) + ((t_vec - 1) * at.log(1 - theta)))
)
# well this doesn't work either
# logp += at.sum(pm.logp(pm.Geometric.dist(p=theta), churned_in_period_t).eval())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were evaling just for debugging I assume?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may not have been in my right mind when trying this.

@ricardoV94
Copy link
Contributor

Your explanation, suggests to me that this is actually a Censoring case. In Truncated cases, we don't know how many observations we have missed, whereas we always do in Censoring. When we arrive at the end of the observation period for a cohort we know exactly how many users we haven't yet observed churning for, which seems like the setting in Censoring.

@ricardoV94
Copy link
Contributor

Those posterior biases scream mis-specified log-likelihood. I'll try to do some math tomorrow morning :D

Copy link
Contributor

@larryshamalama larryshamalama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure that the source of bias in the simulations is due to this line. There should be a nice mathematical justification for this. However, there can be some discussions about this.

  • There is surely a nice mathematical justification to this. I can write something preliminary in a bit.
  • With respect to the data-generating mechanism, nT here refers to people who are explicitly censored at time T. In other words, having lifetime = T - 1 implies being being churned and lifetime = T implies being censored (may or may not be churned).
  • Point 2 was confusing for me for a bit... My understanding is that the difference between contractual and non-contractual (in the discrete setting at least) is very subtle. I think that this will depend on how we define censoring at time T and a contractual model. (Correct me here) In other words, censoring at time T onwards and inclusively could be contractual with end of study time T - 1 or non-contractual with end of study time T.

Points 2 and 3 had me going in circles The only thing I'm relatively confident in is that this change solves the bug in the simulation study.

# logp += at.sum(pm.logp(pm.Geometric.dist(p=theta), churned_in_period_t).eval())

# likelihood for final time step
logp += nT * T * at.log(1 - theta)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logp += nT * T * at.log(1 - theta)
logp += nT * (T - 1) * at.log(1 - theta)

@ricardoV94
Copy link
Contributor

ricardoV94 commented May 10, 2022

What about using a beta (location) * exp (concentration) hyperprior for the hierarchical model? I think it's easier to reason about than the two gamma hyperpriors which mix the location and concentration information.

COORDS = {'cohorts': [f"cohort{n}" for n in range(len(data))]}

with pm.Model(coords=COORDS) as sBG_theta_per_cohort:
    loc = pm.Beta('loc', alpha=1, beta=1)
    concentration = pm.Exponential('concentration', lam=1)
    θ = pm.Beta('θ', loc * concentration, (1-loc) * concentration, dims='cohorts')
    for i, cohort_data in enumerate(data):
        truncated_geometric(f"cohort{i}", cohort_data, θ[i])

@ricardoV94
Copy link
Contributor

I wrote a gist with what I am pretty confident is the sBG model, without the marginalization over theta: https://gist.github.com/ricardoV94/1eba51d051743773eec2e126deda3a74

It's pretty silly to have the number of variables equal the number of costumers, so I understand the point of marginalizing. One nice output of this model is that we can estimate the retention rate of customers that survive up to time T.

@ricardoV94
Copy link
Contributor

@drbenvincent I know this is not high-priority for you right now, but do you think we could get this to the merge point?

@drbenvincent
Copy link
Contributor Author

I have a bit more capacity now. Will try to progress it

@cluhmann
Copy link
Contributor

I'm clearly late to the party, but @tomicapretto and I are working on what I believe to be a discrete/contractual scenario and thus hopped on this PR. To be honest, the synthetic data generation routine included here (i.e., simulate_cohort_data() ) sent me off in the wrong direction for way too long. Only once I took a look at it did I realize that it was throwing out data rather than fixing all "long" lifetimes to T (at which point @ricardoV94 's comment about censoring vs. truncation suddenly made sense). So if I am all caught up (which I may not be), the contractual setting actually corresponds to this:

# geometric lifetimes
lifetimes = geom.rvs(true_churn_p, size=initial_customers)
# censor observations at time T
lifetimes_censor = np.where(lifetimes<T, lifetimes, T)

The reason it is "censored" rather than "truncated" (I dislike these terms because they are so easily confused) is because
the lifetime of any customers currently under contract are not currently known, but we can set lower bounds on their lifetimes. Yes?

If so, then this can be modeled as follows (which is a simplified, fully-pooled analogue of @ricardoV94's gist):

with pm.Model() as cens_geom:
    churn_p = pm.Beta("churn_p", 1, 1)
    obs_latent = pm.Geometric.dist(p=churn_p)
    obs = pm.Censored(
            "obs",
            obs_latent,
            lower=None,
            upper=T,
            observed=lifetimes_censor,
    )

The major advantage here is that this doesn't require any custom components (e.g., use of pm.Potential()) and should thus naturally permit things like posterior predictive sampling, etc.

If that's all correct, then I am wondering it means for the status of this PR.

@ricardoV94
Copy link
Contributor

Closing this in favor of #133

@ricardoV94 ricardoV94 closed this Jan 25, 2023
@twiecki twiecki deleted the sBG-notebook branch September 11, 2024 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants