Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid generating NaNs in UnwoundPathSum #3943

Merged
merged 2 commits into from
Jan 3, 2019
Merged

Conversation

Kodiologist
Copy link
Contributor

@Kodiologist Kodiologist commented Nov 27, 2018

This is my attempt to fix the bug creating NaN SHAP values (#3333). I have an example R script (below) which produces NaN SHAPs without this patch and doesn't produce any with it, but I don't know whether my change is statistically correct.

The script is:

library(xgboost)

d = data.frame(
    x1 = c(-2.3, 1.4, 5.9, 2, 2.5, 0.3, -3.6, -0.2, 0.5, -2.8, -4.6, 3.3, -1.2, -1.1, -2.3, 0.4, -1.5, -0.2, -1, 3.7),
    x2 = c(291.179171, 269.198331, 289.942097, 283.191669, 269.673332, 294.158346, 287.255835, 291.530838, 285.899586, 269.290833, 268.649586, 291.530841, 280.074593, 269.484168, 293.94042, 294.327506, 296.20709, 295.441669, 283.16792, 270.227085),
    y = c(9, 15, 5.7, 9.2, 22.4, 5, 9, 3.2, 7.2, 13.1, 7.8, 16.9, 6.5, 22.1, 5.3, 10.4, 11.1, 13.9, 11, 20.5),
    fold = c(2, 2, 2, 1, 2, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2))

ivs = c("x1", "x2")

set.seed(50)

fit = xgboost(
    verbose = 0,
    params = list(
        objective = "reg:linear",
        eval_metric = "rmse"),
    data = as.matrix(subset(d, fold == 2)[, ivs]),
    label = subset(d, fold == 2)$y,
    nthread = 1,
    nrounds = 3)

shaps = as.data.frame(predict(fit,
    newdata = as.matrix(subset(d, fold == 1)[, ivs]),
    predcontrib = T))
result = cbind(shaps, sum = rowSums(shaps), pred = predict(fit,
    newdata = as.matrix(subset(d, fold == 1)[, ivs])))
print(result)

message(if (identical(TRUE, all.equal(result$sum, result$pred, tol = 1e-6)))
       "Looks good"
   else
       "Oh no")

Its output without the change is:

          x1  x2     BIAS sum      pred
1  0.2951230 NaN 6.967499 NaN  5.478699
2 -0.6753536 NaN 6.967499 NaN  5.478699
3  0.2951230 NaN 6.967499 NaN  5.478699
4  0.8563597 NaN 6.967499 NaN 11.253153
Oh no

Its output with the change is:

          x1         x2     BIAS       sum      pred
1  0.2951230 -1.7839233 6.967499  5.478699  5.478699
2 -0.6753536 -0.8134466 6.967499  5.478699  5.478699
3  0.2951230 -1.7839233 6.967499  5.478699  5.478699
4  0.8563597  3.4292932 6.967499 11.253152 11.253153
Looks good

CC @slundberg, @allanjust, @liuyanguu

Fixes #3333.

@slundberg
Copy link
Contributor

Thanks @Kodiologist! I think your change should retain the correctness of the method. If this resolves the NaNs then the issue must have been when the proportion of training samples and/or depth weighting caused an underflow. In that case they would fail to cancel out. Unless anyone else has comments I think this should be merged.

@Kodiologist
Copy link
Contributor Author

I remembered that the SHAPs for an observation always sum to equal the prediction for that observation, so we can use that as a check. Turns out, it works! Hooray! I've incorporated it into the test script above, which should probably be put in XGBoost's test suite in some form.

By the way, I think the Makefile for this project may not be tracking all dependencies correctly, because I found that I had to do make clean before each make in order for the R package to see changes to the C++ code outside its own interface code.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 29, 2018

@Kodiologist Thanks a lot for your contribution. Before merging this in, let me add your sample snippet as a unit test so that the bug doesn't creep in in the future. As for dependency management, yes, I think there is some issue with Makefile. I usually use CMake instead:

mkdir build
cd build
cmake .. -DR_LIB=ON
make -j10
make install

Then every change in C++ code will cause make install to install fresh binary into R package directory.

@Kodiologist
Copy link
Contributor Author

Oh, okay. You might want to add the CMake call to https://xgboost.readthedocs.io/en/latest/build.html#building-on-ubuntu-debian .

@slundberg
Copy link
Contributor

@hcho3 did you ever end up adding that sample snippet?

@slundberg
Copy link
Contributor

Also looks like the relevant code got moved to tree_model.cc since the initial PR

@hcho3
Copy link
Collaborator

hcho3 commented Dec 31, 2018

@slundberg Sorry, I've been away for vacation. I'll rebase this PR against the latest master now.

Kudos to Jakub Zakrzewski for tracking down the bug.
@hcho3
Copy link
Collaborator

hcho3 commented Dec 31, 2018

@Kodiologist @slundberg Okay, looks like the re-basing was successful. I'll add the test snippet soon.

@codecov-io
Copy link

codecov-io commented Dec 31, 2018

Codecov Report

Merging #3943 into master will decrease coverage by 0.01%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3943      +/-   ##
==========================================
- Coverage   60.79%   60.77%   -0.02%     
==========================================
  Files         130      130              
  Lines       11699    11702       +3     
==========================================
  Hits         7112     7112              
- Misses       4587     4590       +3
Impacted Files Coverage Δ
src/tree/tree_model.cc 17.91% <0%> (-0.23%) ⬇️
tests/cpp/linear/test_linear.cc 100% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f368d0d...1928ffc. Read the comment docs.

@hcho3
Copy link
Collaborator

hcho3 commented Jan 3, 2019

@Kodiologist @slundberg I added the test script. It should be good to merge once all checks pass.

@hcho3 hcho3 merged commit 6a569b8 into dmlc:master Jan 3, 2019
@hcho3
Copy link
Collaborator

hcho3 commented Jan 3, 2019

@Kodiologist Merged. Thanks for your contribution.

CodingCat pushed a commit to CodingCat/xgboost that referenced this pull request Jan 6, 2019
* Avoid generating NaNs in UnwoundPathSum.

Kudos to Jakub Zakrzewski for tracking down the bug.

* Add a test
CodingCat pushed a commit to CodingCat/xgboost that referenced this pull request Jan 6, 2019
* Avoid generating NaNs in UnwoundPathSum.

Kudos to Jakub Zakrzewski for tracking down the bug.

* Add a test
@astrixg
Copy link

astrixg commented Feb 22, 2019

I think this issue still persists. I upgraded to the latest pip library v0.27.0, but still the shap.TreeExplainer.shap_values() function is generating NaN values for some features.

@hcho3
Copy link
Collaborator

hcho3 commented Feb 22, 2019

@astrixg The latest version is 0.81. Can you upgrade? If the issue persists, you should create a new issue.

@astrixg
Copy link

astrixg commented Feb 22, 2019

https://pypi.org/project/shap/
https://anaconda.org/conda-forge/shap

@hcho3 Its 0.28 on pip and 0.27 on conda. Am I installing from the wrong source?

@Kodiologist
Copy link
Contributor Author

@astrixg hcho3 is talking about the xgboost package, not shap. However, upgrading to xgboost 0.81 won't help you because that release predates this PR. To get this fix now, you need to compile and install xgboost master.

@lock lock bot locked as resolved and limited conversation to collaborators May 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NaN SHAP values from Booster.predict with pred_contribs=True
5 participants