sBG model notebook for discrete/contractual setting #32

drbenvincent · 2022-05-06T15:54:35Z

Addressing #25, here is a notebook to demonstrate the sBG model for the discrete/contractual setting.

As well as a general review, here are a few questions or things to think about

feedback on the maths and implementation would be appreciated. We seem to get estimation bias.
~~I'm happy to develop a pm.TruncatedGeometric distribution as part of this pull request, but equally it could be a separate thing.~~ (EDIT: I think this can be done after the notebook is at least semi finished and we have faith that it's working as intended)

On my TODO list is:

Add a summary, particularly highlighting the (possibly wrong) interpretation of a posterior over theta means.
Add code + plot to actually calculate lifetime value distribution.
Better function to generate synthetic data, with actual randomness
add assertion that we have >1 observation
Resolve the estimation bias for multiple cohorts
Add model: theta ~ cohort
~~Add model: theta ~ year~~
~~Add model: theta ~ cohort + year~~
explain why adding year as a predictor would totally change the model

review-notebook-app · 2022-05-06T15:54:39Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

notebooks/custom_distributions.py

juanitorduz · 2022-05-06T19:41:17Z

notebooks/custom_distributions.py

+import pymc as pm
+import numpy as np


As we are using isort for our code style the imports have to be in alphabetical order:

import numpy as np import pymc as pm

If you run make lint inside the conda environment this is also done automatically. I would suggest to use the pre-commit hooks to automate these checks ;)

At the moment I just have black set up in VS Code. Will try to find time to look into this, but could be efficient to spend 5 mins on this in a call at some point.

juanitorduz · 2022-05-06T19:43:16Z

notebooks/plot_utils.py

+import matplotlib.pyplot as plt
+
+
+def plot_xY(x, Y, ax=None):


Maybe we stick with the lower case convention for functions?

Have renamed plot_hdi_func

juanitorduz · 2022-05-06T19:45:45Z

I have not looked into the notebook yet (I will of course). I left some comments on code style. In addition, I would suggest we add the plot_utils.py and custom_distributions into a the pymmmc module (as they are not notebooks).

juanitorduz · 2022-05-06T19:46:46Z

notebooks/plot_utils.py

+    """Plot the posterior mean and 95% and 50% CI's from a given set of x and Y values.
+    x: is a vector
+    Y: is an xarrray of size (chain, draw, dim)
+    """


I suggest we keep using the numpy docstrings guide https://numpydoc.readthedocs.io/en/latest/format.html

I've attempted to improve these

To make it easier to translate into a custom distribution when the time comes

drbenvincent · 2022-05-06T20:49:40Z

I have not looked into the notebook yet (I will of course). I left some comments on code style. In addition, I would suggest we add the plot_utils.py and custom_distributions into a the pymmmc module (as they are not notebooks).

Agreed. And thanks for the comments so far

drbenvincent · 2022-05-06T21:36:20Z

After having installed the pre-commit hooks I can't commit anything until I can figure this out

[WARNING] Unstaged files detected.
[INFO] Stashing unstaged files to /Users/benjamv/.cache/pre-commit/patch1651872664-67063.
black....................................................................Passed
flake8...................................................................Failed
- hook id: flake8
- exit code: 1

Executable `flake8` not found

isort....................................................................Failed
- hook id: isort
- exit code: 1

Executable `isort` not found

Debug Statements (Python)................................................Passed
Trim Trailing Whitespace.................................................Passed
Fix End of Files.........................................................Passed
Check Yaml...........................................(no files to check)Skipped
Check for added large files..............................................Passed
[INFO] Restored changes from /Users/benjamv/.cache/pre-commit/patch1651872664-67063.

I pip installed both flake8 and isort, ran pre-commit run --all and everything passes. Might need some advice @juanitorduz

juanitorduz · 2022-05-06T21:38:03Z

For now you can try git commit -m"my message" -n to skip the checks (see https://stackoverflow.com/questions/7230820/skip-git-commit-hooks)

I'll check the error message tomorrow.

drbenvincent · 2022-05-07T05:09:16Z

NOTE TO SELF: remove the hierarchical inference on the multiple cohort model. There is no point when there is just a single theta?

review-notebook-app · 2022-05-08T17:36:01Z

View / edit / reply to this conversation on ReviewNB

larryshamalama commented on 2022-05-08T17:36:01Z
----------------------------------------------------------------

In your n array, are the 21 individuals explicitly not yet churned? I am going through the 2x2 grid and my understanding is that the subtle difference between discrete contractual and non-contractual settings would be how we characterize the 21 last individuals and how they contribute to the likelihood expression.

drbenvincent commented on 2022-05-09T07:54:33Z
----------------------------------------------------------------

So in the last time period we know that we had 21 subscribers active, but we do not know how many will churn in the current time period.

drbenvincent · 2022-05-09T07:54:34Z

So in the last time period we know that we had 21 subscribers active, but we do not know how many will churn in the current time period.

View entire conversation on ReviewNB

ricardoV94 · 2022-05-09T15:19:11Z

pymmmc/custom_distributions.py

+        churned_in_period_t * (at.log(theta) + ((t_vec - 1) * at.log(1 - theta)))
+    )
+    # well this doesn't work either
+    # logp += at.sum(pm.logp(pm.Geometric.dist(p=theta), churned_in_period_t).eval())


You were evaling just for debugging I assume?

I may not have been in my right mind when trying this.

ricardoV94 · 2022-05-09T15:28:44Z

Your explanation, suggests to me that this is actually a Censoring case. In Truncated cases, we don't know how many observations we have missed, whereas we always do in Censoring. When we arrive at the end of the observation period for a cohort we know exactly how many users we haven't yet observed churning for, which seems like the setting in Censoring.

ricardoV94 · 2022-05-09T16:12:08Z

Those posterior biases scream mis-specified log-likelihood. I'll try to do some math tomorrow morning :D

larryshamalama

I am pretty sure that the source of bias in the simulations is due to this line. There should be a nice mathematical justification for this. However, there can be some discussions about this.

There is surely a nice mathematical justification to this. I can write something preliminary in a bit.
With respect to the data-generating mechanism, nT here refers to people who are explicitly censored at time T. In other words, having lifetime = T - 1 implies being being churned and lifetime = T implies being censored (may or may not be churned).
Point 2 was confusing for me for a bit... My understanding is that the difference between contractual and non-contractual (in the discrete setting at least) is very subtle. I think that this will depend on how we define censoring at time T and a contractual model. (Correct me here) In other words, censoring at time T onwards and inclusively could be contractual with end of study time T - 1 or non-contractual with end of study time T.

Points 2 and 3 had me going in circles The only thing I'm relatively confident in is that this change solves the bug in the simulation study.

larryshamalama · 2022-05-09T20:22:52Z

pymmmc/custom_distributions.py

+    # logp += at.sum(pm.logp(pm.Geometric.dist(p=theta), churned_in_period_t).eval())
+
+    # likelihood for final time step
+    logp += nT * T * at.log(1 - theta)


Suggested change

logp += nT * T * at.log(1 - theta)

logp += nT * (T - 1) * at.log(1 - theta)

ricardoV94 · 2022-05-10T07:59:18Z

What about using a beta (location) * exp (concentration) hyperprior for the hierarchical model? I think it's easier to reason about than the two gamma hyperpriors which mix the location and concentration information.

COORDS = {'cohorts': [f"cohort{n}" for n in range(len(data))]}

with pm.Model(coords=COORDS) as sBG_theta_per_cohort:
    loc = pm.Beta('loc', alpha=1, beta=1)
    concentration = pm.Exponential('concentration', lam=1)
    θ = pm.Beta('θ', loc * concentration, (1-loc) * concentration, dims='cohorts')
    for i, cohort_data in enumerate(data):
        truncated_geometric(f"cohort{i}", cohort_data, θ[i])

ricardoV94 · 2022-05-10T15:13:02Z

I wrote a gist with what I am pretty confident is the sBG model, without the marginalization over theta: https://gist.github.com/ricardoV94/1eba51d051743773eec2e126deda3a74

It's pretty silly to have the number of variables equal the number of costumers, so I understand the point of marginalizing. One nice output of this model is that we can estimate the retention rate of customers that survive up to time T.

ricardoV94 · 2022-06-06T15:04:51Z

@drbenvincent I know this is not high-priority for you right now, but do you think we could get this to the merge point?

drbenvincent · 2022-06-07T13:06:52Z

I have a bit more capacity now. Will try to progress it

cluhmann · 2022-08-12T17:07:17Z

I'm clearly late to the party, but @tomicapretto and I are working on what I believe to be a discrete/contractual scenario and thus hopped on this PR. To be honest, the synthetic data generation routine included here (i.e., simulate_cohort_data() ) sent me off in the wrong direction for way too long. Only once I took a look at it did I realize that it was throwing out data rather than fixing all "long" lifetimes to T (at which point @ricardoV94 's comment about censoring vs. truncation suddenly made sense). So if I am all caught up (which I may not be), the contractual setting actually corresponds to this:

# geometric lifetimes
lifetimes = geom.rvs(true_churn_p, size=initial_customers)
# censor observations at time T
lifetimes_censor = np.where(lifetimes<T, lifetimes, T)

The reason it is "censored" rather than "truncated" (I dislike these terms because they are so easily confused) is because
the lifetime of any customers currently under contract are not currently known, but we can set lower bounds on their lifetimes. Yes?

If so, then this can be modeled as follows (which is a simplified, fully-pooled analogue of @ricardoV94's gist):

with pm.Model() as cens_geom:
    churn_p = pm.Beta("churn_p", 1, 1)
    obs_latent = pm.Geometric.dist(p=churn_p)
    obs = pm.Censored(
            "obs",
            obs_latent,
            lower=None,
            upper=T,
            observed=lifetimes_censor,
    )

The major advantage here is that this doesn't require any custom components (e.g., use of pm.Potential()) and should thus naturally permit things like posterior predictive sampling, etc.

If that's all correct, then I am wondering it means for the status of this PR.

ricardoV94 · 2023-01-25T12:30:37Z

Closing this in favor of #133

Benjamin T. Vincent added 5 commits May 4, 2022 22:00

initial commit of sBG model

63dc99b

truncated_geometric distribution - remove exponents from inside logs

c98ffa7

tweak intro + add reference to plexagon/lucius-ltv

acf4672

adds a plotting utility function - median + hdi's

49cc80b

add survival function plots + update introduction + references

9ace476

juanitorduz reviewed May 6, 2022

View reviewed changes

notebooks/custom_distributions.py Outdated Show resolved Hide resolved

juanitorduz reviewed May 6, 2022

View reviewed changes

notebooks/custom_distributions.py Outdated Show resolved Hide resolved

juanitorduz reviewed May 6, 2022

View reviewed changes

refactor truncated_geometric to include a separate logp function

114c1af

To make it easier to translate into a custom distribution when the time comes

Benjamin T. Vincent added 6 commits May 6, 2022 21:49

move plot_utils.py and custom_distributions.py into pymmc package

33919f3

attempt to improve docstrings

7bae7d3

update notebook to reflect recent code changes

24c327d

improve docstring for plot_hdi_func

9f2572e

update .gitignore

607d193

different alpha levels for 50% and 95% intervals

38b6675

Benjamin T. Vincent added 7 commits May 7, 2022 10:09

update after doing make lint and pre-commit stuff

8700f64

add cohort effects model + plotting changes

044d687

simulate data way better

37f2182

failed attempt to fix likelihood

d128817

minor tweak to likelihood

e882128

failed attempt with pm.Geometric

756d356

fixed for single cohort

d9f9d56

ricardoV94 reviewed May 9, 2022

View reviewed changes

larryshamalama reviewed May 9, 2022

View reviewed changes

This was referenced Jun 10, 2022

Add notebooks to fill the basic CLV grid #25

Open

Implement pre-built shifted beta geometric sBG model #40

Closed

ricardoV94 closed this Jan 25, 2023

twiecki deleted the sBG-notebook branch September 11, 2024 07:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sBG model notebook for discrete/contractual setting #32

sBG model notebook for discrete/contractual setting #32

drbenvincent commented May 6, 2022 •

edited

Loading

review-notebook-app bot commented May 6, 2022

juanitorduz May 6, 2022

drbenvincent May 6, 2022

juanitorduz May 6, 2022 •

edited

Loading

drbenvincent May 6, 2022

juanitorduz commented May 6, 2022

juanitorduz May 6, 2022

drbenvincent May 6, 2022

drbenvincent commented May 6, 2022

drbenvincent commented May 6, 2022

juanitorduz commented May 6, 2022 •

edited

Loading

drbenvincent commented May 7, 2022

review-notebook-app bot commented May 8, 2022 •

edited

Loading

drbenvincent commented May 9, 2022

ricardoV94 May 9, 2022

drbenvincent May 9, 2022

ricardoV94 commented May 9, 2022

ricardoV94 commented May 9, 2022

larryshamalama left a comment •

edited

Loading

larryshamalama May 9, 2022

ricardoV94 commented May 10, 2022 •

edited

Loading

ricardoV94 commented May 10, 2022

ricardoV94 commented Jun 6, 2022

drbenvincent commented Jun 7, 2022

cluhmann commented Aug 12, 2022

ricardoV94 commented Jan 25, 2023

	logp += nT * T * at.log(1 - theta)
	logp += nT * (T - 1) * at.log(1 - theta)

sBG model notebook for discrete/contractual setting #32

sBG model notebook for discrete/contractual setting #32

Conversation

drbenvincent commented May 6, 2022 • edited Loading

review-notebook-app bot commented May 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanitorduz May 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanitorduz commented May 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drbenvincent commented May 6, 2022

drbenvincent commented May 6, 2022

juanitorduz commented May 6, 2022 • edited Loading

drbenvincent commented May 7, 2022

review-notebook-app bot commented May 8, 2022 • edited Loading

drbenvincent commented May 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ricardoV94 commented May 9, 2022

ricardoV94 commented May 9, 2022

larryshamalama left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ricardoV94 commented May 10, 2022 • edited Loading

ricardoV94 commented May 10, 2022

ricardoV94 commented Jun 6, 2022

drbenvincent commented Jun 7, 2022

cluhmann commented Aug 12, 2022

ricardoV94 commented Jan 25, 2023

drbenvincent commented May 6, 2022 •

edited

Loading

juanitorduz May 6, 2022 •

edited

Loading

juanitorduz commented May 6, 2022 •

edited

Loading

review-notebook-app bot commented May 8, 2022 •

edited

Loading

larryshamalama left a comment •

edited

Loading

ricardoV94 commented May 10, 2022 •

edited

Loading