-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sBG model notebook for discrete/contractual setting #32
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
notebooks/custom_distributions.py
Outdated
import pymc as pm | ||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we are using isort
for our code style the imports have to be in alphabetical order:
import numpy as np
import pymc as pm
If you run make lint
inside the conda environment this is also done automatically. I would suggest to use the pre-commit
hooks to automate these checks ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment I just have black set up in VS Code. Will try to find time to look into this, but could be efficient to spend 5 mins on this in a call at some point.
notebooks/plot_utils.py
Outdated
import matplotlib.pyplot as plt | ||
|
||
|
||
def plot_xY(x, Y, ax=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we stick with the lower case convention for functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have renamed plot_hdi_func
I have not looked into the notebook yet (I will of course). I left some comments on code style. In addition, I would suggest we add the |
notebooks/plot_utils.py
Outdated
"""Plot the posterior mean and 95% and 50% CI's from a given set of x and Y values. | ||
x: is a vector | ||
Y: is an xarrray of size (chain, draw, dim) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we keep using the numpy
docstrings guide https://numpydoc.readthedocs.io/en/latest/format.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've attempted to improve these
To make it easier to translate into a custom distribution when the time comes
Agreed. And thanks for the comments so far |
After having installed the pre-commit hooks I can't commit anything until I can figure this out
I pip installed both |
For now you can try git commit -m"my message" -n to skip the checks (see https://stackoverflow.com/questions/7230820/skip-git-commit-hooks) I'll check the error message tomorrow. |
NOTE TO SELF: remove the hierarchical inference on the multiple cohort model. There is no point when there is just a single theta? |
View / edit / reply to this conversation on ReviewNB larryshamalama commented on 2022-05-08T17:36:01Z In your drbenvincent commented on 2022-05-09T07:54:33Z So in the last time period we know that we had 21 subscribers active, but we do not know how many will churn in the current time period. |
So in the last time period we know that we had 21 subscribers active, but we do not know how many will churn in the current time period. View entire conversation on ReviewNB |
churned_in_period_t * (at.log(theta) + ((t_vec - 1) * at.log(1 - theta))) | ||
) | ||
# well this doesn't work either | ||
# logp += at.sum(pm.logp(pm.Geometric.dist(p=theta), churned_in_period_t).eval()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were eval
ing just for debugging I assume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may not have been in my right mind when trying this.
Your explanation, suggests to me that this is actually a Censoring case. In Truncated cases, we don't know how many observations we have missed, whereas we always do in Censoring. When we arrive at the end of the observation period for a cohort we know exactly how many users we haven't yet observed churning for, which seems like the setting in Censoring. |
Those posterior biases scream mis-specified log-likelihood. I'll try to do some math tomorrow morning :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure that the source of bias in the simulations is due to this line. There should be a nice mathematical justification for this. However, there can be some discussions about this.
- There is surely a nice mathematical justification to this. I can write something preliminary in a bit.
- With respect to the data-generating mechanism,
nT
here refers to people who are explicitly censored at timeT
. In other words, havinglifetime = T - 1
implies being being churned andlifetime = T
implies being censored (may or may not be churned). - Point 2 was confusing for me for a bit... My understanding is that the difference between contractual and non-contractual (in the discrete setting at least) is very subtle. I think that this will depend on how we define censoring at time
T
and a contractual model. (Correct me here) In other words, censoring at timeT
onwards and inclusively could be contractual with end of study timeT - 1
or non-contractual with end of study timeT
.
Points 2 and 3 had me going in circles The only thing I'm relatively confident in is that this change solves the bug in the simulation study.
# logp += at.sum(pm.logp(pm.Geometric.dist(p=theta), churned_in_period_t).eval()) | ||
|
||
# likelihood for final time step | ||
logp += nT * T * at.log(1 - theta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logp += nT * T * at.log(1 - theta) | |
logp += nT * (T - 1) * at.log(1 - theta) |
What about using a COORDS = {'cohorts': [f"cohort{n}" for n in range(len(data))]}
with pm.Model(coords=COORDS) as sBG_theta_per_cohort:
loc = pm.Beta('loc', alpha=1, beta=1)
concentration = pm.Exponential('concentration', lam=1)
θ = pm.Beta('θ', loc * concentration, (1-loc) * concentration, dims='cohorts')
for i, cohort_data in enumerate(data):
truncated_geometric(f"cohort{i}", cohort_data, θ[i]) |
I wrote a gist with what I am pretty confident is the sBG model, without the marginalization over theta: https://gist.github.com/ricardoV94/1eba51d051743773eec2e126deda3a74 It's pretty silly to have the number of variables equal the number of costumers, so I understand the point of marginalizing. One nice output of this model is that we can estimate the retention rate of customers that survive up to time T. |
@drbenvincent I know this is not high-priority for you right now, but do you think we could get this to the merge point? |
I have a bit more capacity now. Will try to progress it |
I'm clearly late to the party, but @tomicapretto and I are working on what I believe to be a discrete/contractual scenario and thus hopped on this PR. To be honest, the synthetic data generation routine included here (i.e., # geometric lifetimes
lifetimes = geom.rvs(true_churn_p, size=initial_customers)
# censor observations at time T
lifetimes_censor = np.where(lifetimes<T, lifetimes, T) The reason it is "censored" rather than "truncated" (I dislike these terms because they are so easily confused) is because If so, then this can be modeled as follows (which is a simplified, fully-pooled analogue of @ricardoV94's gist): with pm.Model() as cens_geom:
churn_p = pm.Beta("churn_p", 1, 1)
obs_latent = pm.Geometric.dist(p=churn_p)
obs = pm.Censored(
"obs",
obs_latent,
lower=None,
upper=T,
observed=lifetimes_censor,
) The major advantage here is that this doesn't require any custom components (e.g., use of If that's all correct, then I am wondering it means for the status of this PR. |
Closing this in favor of #133 |
Addressing #25, here is a notebook to demonstrate the sBG model for the discrete/contractual setting.
As well as a general review, here are a few questions or things to think about
I'm happy to develop a(EDIT: I think this can be done after the notebook is at least semi finished and we have faith that it's working as intended)pm.TruncatedGeometric
distribution as part of this pull request, but equally it could be a separate thing.On my TODO list is:
theta ~ cohort
Add model:theta ~ year
Add model:theta ~ cohort + year