Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightGBM + categorical features broken inside ShapRFECV #138

Closed
timlod opened this issue Apr 16, 2021 · 4 comments
Closed

LightGBM + categorical features broken inside ShapRFECV #138

timlod opened this issue Apr 16, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@timlod
Copy link
Contributor

timlod commented Apr 16, 2021

Describe the bug

When using LightGBM on a dataset including pd.Categorical features, the shap Explainer will fail, advising you to use feature_perturbation="tree_path_dependent". However, since we're using LightGBM, the algorithm will already choose this by default - the real issue is that background data is passed, which isn't supported together with feature_perturbation="tree_path_dependent". BG data is passed as mask in shap_calc().

Environment (please complete the following information):

  • probatus 1.8
  • python 3.8

To Reproduce

Use LightGBM in ShapRFECV on a dataset with categorical features.

import lightgbm as lgbm
from probatus.feature_elimination import ShapRFECV

model = lgbm.LGBMClassifier()
shap_elimination = ShapRFECV(
    clf=model,
    step=0.2,
    cv=5
)
# X is dataframe with pd.Categorical, y is binary response
report = shap_elimination.fit_compute(X, y)

Error traceback
Can't provide right now as I've already fixed this on my branch, but the error will be:

                raise Exception("Currently TreeExplainer can only handle models with categorical splits when " \
                                "feature_perturbation=\"tree_path_dependent\" and no background data is passed. Please try again using " \
                                "shap.TreeExplainer(model, feature_perturbation=\"tree_path_dependent\").")

from inside shap _tree.py, as called inside shap_calc().

Expected behavior
It runs without issue, as there is support for trees with categorical features in shap.

Proposed fix
Check model type and X features inside shap_calc, and avoid passing mask if there are categorical features and the model is tree-based.

@timlod timlod added the bug Something isn't working label Apr 16, 2021
@Matgrb
Copy link
Contributor

Matgrb commented Apr 16, 2021

Great finding, I think this would be the way to go.

probatus by default transforms dataset to Df, and all categorical features have "category" dtype. Indeed not passing the mask if categorical features would be nice.

I am curious whether this applies only to tree-based models, are there any linear ones that support categorical features?

Feel free to pick this issue up! 👍

@timlod
Copy link
Contributor Author

timlod commented Apr 16, 2021

I'll submit a PR later today implementing both this and #106!
I believe it applies mainly to LGBM (not sure which, if any, other models handle categorical natively). Linear models wouldn't be able to support categorical features - perhaps on the surface, but they'd have to one-hot encode internally as you can't make coefficients for non-numerical data.

@operte
Copy link
Contributor

operte commented Apr 16, 2021

Good one! I think this will also solve the problems we were having with testing dummy data with a categorical column, won't it @Matgrb?

@Matgrb
Copy link
Contributor

Matgrb commented Apr 18, 2021

Covered in #139

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants