-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Return total SHAP per feature as a new result type #1387
Conversation
For v7.9.0 only if we manage to have a Java parser implemented timely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a couple of style suggestions. My main comments are I think:
- We should normalise by the document count. (You should be able to switch the map's value type to a
CBasicStatistics::SSampleMean<TVector>::TAccumulator
to do this. Although note you have to initialise this with a zero vector of the correct size.) - It would be nice to compute these quantities accurately, i.e. not just summing over the top importances for each document.
We agreed to defer this from 7.9 to 7.10, so I altered the labels. |
writer.Double(item.second(0)); | ||
} else { | ||
for (int j = 0; j < item.second.size() && j < numberClasses; ++j) { | ||
writer.Key(classValues[j]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not work for storage in ES. This index stores the information for all trained models and indexing the class names for the feature importances will not scale.
I propose this format:
{
"feature_name": "c4",
"importance": 0.4810469375580312,
"class_importance": [
{
"class_name": "foo",
"importance": 0.24052346877901588
},
{
"class_name": "baz",
"importance": 0.19615020783390645
},
{
"class_name": "bar",
"importance": 0.04437326094510882
}
]
}
class_importance
will be a nested
data type that allows aggregations and searches for specific models and classnames.
Java side parsing: elastic/elasticsearch#59725 We will probably have to mute integration tests, merge C++ side, then unmute and merge the parsing java side. |
@valeriy42 the new format has |
@benwtrent I cannot think of anything more. |
@valeriy42 it would be nice if the per class importance looked like the regression importance. That way we have consistent JSON objects.
|
@benwtrent you are right! I overlooked at in my past commit. I'll fix it accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a couple of minor comments and a suggestion for one additional bit of testing, which I think is worthwhile. Otherwise, LGTM.
This PR add computation of the total feature importance values.
This PR add computation of the total feature importance values. Backport of #1387.
This updates the feature_importance mapping change from elastic/ml-cpp#1387
This updates the feature_importance mapping change from elastic/ml-cpp#1387
This updates the feature_importance mapping change from elastic/ml-cpp#1387
Activate the output of the model metadata and the corresponding unit tests for total feature importance. The implementation itself was introduced in #1387 however, I need to fix the documentation, it was originally attributed to v7.10. Hence, I mark this PR as enhancement to rectify the docs.
Activate the output of the model metadata and the corresponding unit tests for total feature importance. The implementation itself was introduced in #1387 however, I need to fix the documentation, it was originally attributed to v7.10. Hence, I mark this PR as enhancement to rectify the docs. Co-authored-by: Valeriy Khakhutskyy <[email protected]>
Activate the output of the model metadata and the corresponding unit tests for total feature importance. The implementation itself was introduced in elastic#1387 however, I need to fix the documentation, it was originally attributed to v7.10. Hence, I mark this PR as enhancement to rectify the docs.
This PR add computation of the total feature importance values and outputs it as a new result type.
Example outputs:
Closes #974
EDIT: I updated the format example above.