Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-3184] Builtin for computing information gain using entropy and gini #1520

Merged
merged 4 commits into from
Feb 12, 2022

Conversation

morf1us
Copy link
Contributor

@morf1us morf1us commented Jan 21, 2022

This builtin computes the measure of impurity for the given dataset based on the passed method. The current version expects the target vector to contain only 0 or 1 values and categorical data to be positive integers. Additionally, the builtin expects a row vector R to denote which features are continuous and which are categorical. In case of continuous features, the current implementation applies equal width binning.
It returns a row vector with gini gain or information gain for each feature. In both cases, the higher the gain, the better the split.

@j143
Copy link
Contributor

j143 commented Jan 22, 2022

Hi @morf1us - thanks a lot for the contribution. 😸

  1. How about adding usage instructions in builtins-reference.md. Also you can add latex formulas generously if you want to, in the description.
  2. testing seems fine.

Keep working on finalizing bins related changes.

@morf1us
Copy link
Contributor Author

morf1us commented Feb 8, 2022

Hi @j143 Thanks for reviewing!

I added usage instructions and also added some more tests.

@j143 j143 self-requested a review February 9, 2022 06:35
@j143
Copy link
Contributor

j143 commented Feb 10, 2022

Hi @morf1us - Just curious. Did you notice the impurity measures logic in decisionTree.dml? The implementation in this PR and that one is same?

Docs are here: https://apache.github.io/systemds/site/algorithms-classification.html#decision-trees

calcGiniImpurity = function(Double num_true, Double num_false) return (Double impurity) {
prop_true = num_true / (num_true + num_false)
prop_false = num_false / (num_true + num_false)
impurity = 1 - (prop_true ^ 2) - (prop_false ^ 2)
}
calcImpurity = function(
Matrix[Double] X,
Matrix[Double] Y,
Matrix[Double] use_rows_vector,
Double col,
Double type,
int bins) return (Double impurity, Matrix[Double] threshold) {
is_scalar_type = typeIsScalar(type)
if (is_scalar_type) {
possible_thresholds = calcPossibleThresholdsScalar(X, use_rows_vector, col, bins)
} else {
possible_thresholds = calcPossibleThresholdsCategory(type)
}
len_thresholds = ncol(possible_thresholds)
impurity = 1
threshold = matrix(0, rows=1, cols=1)
for (index in 1:len_thresholds) {
[false_rows, true_rows] = splitRowsVector(X, use_rows_vector, col, possible_thresholds[, index], type)
num_true_positive = 0; num_false_positive = 0; num_true_negative = 0; num_false_negative = 0
len = dataVectorLength(use_rows_vector)
for (c_row in 1:len) {
true_row_data = dataVectorGet(true_rows, c_row)
false_row_data = dataVectorGet(false_rows, c_row)
if (true_row_data != 0 & false_row_data == 0) { # IT'S POSITIVE!
if (as.scalar(Y[c_row, 1]) != 0) {
num_true_positive = num_true_positive + 1
} else {
num_false_positive = num_false_positive + 1
}
} else if (true_row_data == 0 & false_row_data != 0) { # IT'S NEGATIVE
if (as.scalar(Y[c_row, 1]) != 0.0) {
num_false_negative = num_false_negative + 1
} else {
num_true_negative = num_true_negative + 1
}
}
}
impurity_positive_branch = calcGiniImpurity(num_true_positive, num_false_positive)
impurity_negative_branch = calcGiniImpurity(num_true_negative, num_false_negative)
num_samples = num_true_positive + num_false_positive + num_true_negative + num_false_negative
num_negative = num_true_negative + num_false_negative
num_positive = num_true_positive + num_false_positive
c_impurity = num_positive / num_samples * impurity_positive_branch + num_negative / num_samples * impurity_negative_branch
if (c_impurity <= impurity) {
impurity = c_impurity
threshold = possible_thresholds[, index]
}
}
}
calcBestSplittingCriteria = function(
Matrix[Double] X,
Matrix[Double] Y,
Matrix[Double] R,
Matrix[Double] use_rows_vector,
Matrix[Double] use_cols_vector,
int bins) return (Double impurity, Double used_col, Matrix[Double] threshold, Double type) {
impurity = 1
used_col = 1
threshold = matrix(0, 1, 1)
type = 1
# -- user-defined function calls not supported for iterable predicates
len = dataVectorLength(use_cols_vector)
for (c_col in 1:len) {
use_feature = dataVectorGet(use_cols_vector, c_col)
if (use_feature != 0) {
c_type = getTypeOfCol(R, c_col)
[c_impurity, c_threshold] = calcImpurity(X, Y, use_rows_vector, c_col, c_type, bins)
if(c_impurity <= impurity) {
impurity = c_impurity
used_col = c_col
threshold = c_threshold
type = c_type
}
}
}
}

Copy link
Contributor

@j143 j143 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

The code looks good. But, need to check whether the implementation is efficient (with edge cases considered) compared to the one implemented in the scripts/builtin/decisionTrees.dml.

I will have a look at the other code, shortly.

@morf1us
Copy link
Contributor Author

morf1us commented Feb 11, 2022

Hi @j143 Thanks for taking the time. Yes, it is quite similar. I was aware of both the decisionTree and randomForest implementations before starting.

@j143
Copy link
Contributor

j143 commented Feb 11, 2022

Yes, it is quite similar. I was aware of both the decisionTree and randomForest implementations before starting.

Yes, eventually we need to use Impurity measures inside the the scripts.

@j143
Copy link
Contributor

j143 commented Feb 12, 2022

Thank you, @morf1us - LGTM. 👍
🎉 🚀

We can work on using these measures functions inside the decisionTree scripts, later. Perhaps would you like to take that?

@j143 j143 merged commit 7c3cc82 into apache:main Feb 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants