-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYSTEMDS-3184] Builtin for computing information gain using entropy and gini #1520
Conversation
Hi @morf1us - thanks a lot for the contribution. 😸
Keep working on finalizing |
Hi @j143 Thanks for reviewing! I added usage instructions and also added some more tests. |
Hi @morf1us - Just curious. Did you notice the impurity measures logic in decisionTree.dml? The implementation in this PR and that one is same? Docs are here: https://apache.github.io/systemds/site/algorithms-classification.html#decision-trees systemds/scripts/builtin/decisionTree.dml Lines 264 to 348 in b8d4897
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
The code looks good. But, need to check whether the implementation is efficient (with edge cases considered) compared to the one implemented in the scripts/builtin/decisionTrees.dml
.
I will have a look at the other code, shortly.
Hi @j143 Thanks for taking the time. Yes, it is quite similar. I was aware of both the decisionTree and randomForest implementations before starting. |
Yes, eventually we need to use Impurity measures inside the the scripts. |
Thank you, @morf1us - LGTM. 👍 We can work on using these measures functions inside the decisionTree scripts, later. Perhaps would you like to take that? |
This builtin computes the measure of impurity for the given dataset based on the passed method. The current version expects the target vector to contain only 0 or 1 values and categorical data to be positive integers. Additionally, the builtin expects a row vector R to denote which features are continuous and which are categorical. In case of continuous features, the current implementation applies equal width binning.
It returns a row vector with gini gain or information gain for each feature. In both cases, the higher the gain, the better the split.