-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow static covariates in BGNBDModel #1390
base: main
Are you sure you want to change the base?
Conversation
…taGeoModel. Add some tests
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1390 +/- ##
==========================================
- Coverage 92.58% 92.48% -0.11%
==========================================
Files 52 52
Lines 6043 6095 +52
==========================================
+ Hits 5595 5637 +42
- Misses 448 458 +10 ☔ View full report in Codecov by Sentry. |
Merging #1375 has created a merge conflict in |
The |
…thCovariates.test_extract_predictive_covariates
I agree it should be left to a separate PR, but before creating an issue let me give it some more thought. This would require refactoring to pass arguments into the internal method for the specific model param names, and probably conditionals for the child model class instance itself. I'm not sure if an overly complex, hard to maintain base class method is worth eliminating boilerplate in the child classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly minor comments around things like renaming variables, but the poor convergence in distribution_new_customers
needs more investigation and plotting of visuals in a notebook.
On that note, a covariate example will need to be added to the notebooks for BGNBD and/or the Quickstart. You can work off the example in the PNBD notebook.
I have an idea on why distribution_new_customers
may not be converging well. I'll create an issue for it.
@@ -140,6 +144,9 @@ class BetaGeoModel(CLVModel): | |||
Error Problem." http://brucehardie.com/notes/027/bgnbd_num_error.pdf. | |||
.. [4] Fader, P. S. & Hardie, B. G. (2019) "A Step-by-Step Derivation of the BG/NBD | |||
Model." https://www.brucehardie.com/notes/039/bgnbd_derivation__2019-11-06.pdf | |||
.. [5] Fader, Peter & G. S. Hardie, Bruce (2007). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add an in-line citation for this reference in the top-level of the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
purchase_coefficient_gamma1 = self.model_config[ | ||
"purchase_coefficient_prior" | ||
].create_variable("purchase_coefficient_gamma1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the gamma1
suffix being used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible to use model coordinates instead?
dropout_coefficient_gamma2 = self.model_config[ | ||
"dropout_coefficient_prior" | ||
].create_variable("dropout_coefficient_gamma2") | ||
dropout_coefficient_gamma3 = self.model_config[ | ||
"dropout_coefficient_prior" | ||
].create_variable("dropout_coefficient_gamma3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change these _gamma%
suffixes to _alpha
and _beta
? Gamma is a confusing term because it pertains to the purchasing process in the research.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to follow the convention here
. There is no
beta
, but a
and b
(I can call them coefficient_a, and coefficient_b if you like)
- We can rename as
purchase_coefficient
,dropout_coefficient
to follow the implementation inParetoNBDModel
. But then thegamma2
andgamma3
coefficients must be equal. This is in fact how the implementation in R's CLVTools is done btw. - I tried that implementation, and is not helping with the convergence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh my mistake - I meant _a
and _b
.
Are you saying CLVTools fixes these coefficients to be equal to each other? They share the same data, but this doesn't seem to align with the research note.
Also, which implementation is not helping with convergence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying CLVTools fixes these coefficients to be equal to each other? They share the same data, but this doesn't seem to align with the research note.
That is indeed the case. See here and here
Also, which implementation is not helping with convergence?
Using the same implementation as CLVTools, fixing gamma2=gamma3. It did not help with the test issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My best guess as to why the CLVTools developers did this was for easier interpretability and/or to speed up model fits. What's weird is why they still went with it if convergence is negatively impacted. This could be a good selling point for pymc-marketing
compared to other open-source tools.
Explaining covariate impacts on overall dropout in terms of separate a
and b
coefficients will be tricky, but not impossible. Generally if a >b
, then p
increases, and vice-versa. The greater both values are, the narrower the distribution.
dropout_coefficient_gamma2 = self.model_config[ | ||
"dropout_coefficient_prior" | ||
].create_variable("dropout_coefficient_gamma2") | ||
dropout_coefficient_gamma3 = self.model_config[ | ||
"dropout_coefficient_prior" | ||
].create_variable("dropout_coefficient_gamma3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment.
This hierarchical pooling conditional block may be worth its own internal method because it appears in several models, but we can leave that as a separate PR.
tests/clv/models/test_beta_geo.py
Outdated
# NOTE: These are the problematic tests due to poor convergence | ||
# We would need to test eg: | ||
# assert (res_zero["dropout"] < res_high["dropout"]).all() | ||
# Instead we test "less than" within tolerance | ||
assert ( | ||
( | ||
res_zero["recency_frequency"].sel(obs_var="recency") | ||
- res_high["recency_frequency"].sel(obs_var="recency") | ||
) | ||
< 0.35 | ||
).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to break this into multiple tests, because there are 9 separate assert
statements happening here.
We should also investigate this particular assert
in a notebook because the corresponding test in ParetoNBDModel
doesn't have this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to the notebook. I am drafting something in #1430. That PR is more focused on:
- Fixing plotting to allow covariates
- Introducing a new dataset
ApparelTrans
exported fromCLVTools
's covariates examples
I wanted to keep this in a separate PR, so I will be adding here only a small gist with convergence plots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced these final tests all together taking onto account the properties of the Beta distribution.
It should make sense now.
Created an issue that may be related to the poor convergence: #1431 |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
OK, the plot kind of thickened.
I added
This was the whole thing!. I was not taking into account the properties of the Beta dist, and following the ParetoNBD implementation blindly. Working on the notebook, I thought I was going crazy. E["dropout"]~0.5 always, regardless of the covariates I was using. Note that in the test setup we have equal coefficients: Note however some findings. We are using |
hey! are we missing something here? any blockers :) ? |
Just haven't found time to look at it yet 🤔 Reviewing now. |
It is good to ship on my side, and would allow to work on a clean branch on the addition of covariates to the @ColtAllen requested changes a couple of weeks ago related to some tests. I believe I've addressed all the worries he expressed (see dev notebook), but would be good to have the green light on his side. |
|
|
||
a = pm.Deterministic("a", phi_dropout * kappa_dropout) | ||
b = pm.Deterministic("b", (1.0 - phi_dropout) * kappa_dropout) | ||
if self.dropout_covariate_cols: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested the phi/kappa hierarchical pooling for a_scale
and b_scale
with covariates? Any major differences in results and/or fit times compared to specifying a prior for these params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can play with it in #1430 docs/source/notebooks/clv/dev/bg_nbd_covariates.ipynb
. By removing the custom prior, and using default_model_config
as in:
bgnbd = clv.BetaGeoModel(
rfm_data,
)
It works. The model fits, but we get quite biased estimates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So specifying a_prior
and b_prior
in the model config is recommended when using covariates? We should probably mention this somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
def test_expectation_method(self): | ||
"""Test that predictive methods work with covariates""" | ||
# Higher covariates with positive coefficients -> higher change of death and vice-versa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do your experiments in the dev notebook confirm this? This seems copy/pasted from the equivalent test for ParetoNBD model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good! Just some clarifying questions regarding notebook experiments and request to add an additional test condition, and I think this will be good to merge!
rtol=0.6, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an additional test to check for the nested prior config when a_prior
and b_prior
aren't specified?
Also, is this the smallest rtol you could get passing results with? The same test has an rtol of only 0.2 for ParetoNBDModel
.
Description
Allows static covariates in
BetaGeoModel
NOTE: It seems there are convergence issues with the dropout-covariates-related paramsa|b
. Related to similar observations by @juanitorduz here. As a consequence the last two assertions intest_distribution_method
are a dubious hack.Related Issue
Checklist
Modules affected
Type of change
📚 Documentation preview 📚: https://pymc-marketing--1390.org.readthedocs.build/en/1390/