[SYSTEMDS-3184] Builtin for computing information gain using entropy and gini #1520

morf1us · 2022-01-21T23:30:23Z

This builtin computes the measure of impurity for the given dataset based on the passed method. The current version expects the target vector to contain only 0 or 1 values and categorical data to be positive integers. Additionally, the builtin expects a row vector R to denote which features are continuous and which are categorical. In case of continuous features, the current implementation applies equal width binning.
It returns a row vector with gini gain or information gain for each feature. In both cases, the higher the gain, the better the split.

j143 · 2022-01-22T11:43:38Z

Hi @morf1us - thanks a lot for the contribution. 😸

How about adding usage instructions in builtins-reference.md. Also you can add latex formulas generously if you want to, in the description.
testing seems fine.

Keep working on finalizing bins related changes.

morf1us · 2022-02-08T07:06:48Z

Hi @j143 Thanks for reviewing!

I added usage instructions and also added some more tests.

j143 · 2022-02-10T12:15:37Z

Hi @morf1us - Just curious. Did you notice the impurity measures logic in decisionTree.dml? The implementation in this PR and that one is same?

Docs are here: https://apache.github.io/systemds/site/algorithms-classification.html#decision-trees

systemds/scripts/builtin/decisionTree.dml

Lines 264 to 348 in b8d4897

    
           calcGiniImpurity = function(Double num_true, Double num_false) return (Double impurity) { 
        
             prop_true = num_true / (num_true + num_false) 
        
             prop_false = num_false / (num_true + num_false) 
        
             impurity = 1 - (prop_true ^ 2) - (prop_false ^ 2) 
        
           } 
        
           calcImpurity = function( 
        
             Matrix[Double] X, 
        
             Matrix[Double] Y, 
        
             Matrix[Double] use_rows_vector, 
        
             Double col, 
        
             Double type, 
        
             int bins) return (Double impurity, Matrix[Double] threshold) { 
        
             is_scalar_type = typeIsScalar(type) 
        
             if (is_scalar_type) { 
        
               possible_thresholds = calcPossibleThresholdsScalar(X, use_rows_vector, col, bins) 
        
             } else { 
        
               possible_thresholds = calcPossibleThresholdsCategory(type) 
        
             } 
        
             len_thresholds = ncol(possible_thresholds) 
        
             impurity = 1 
        
             threshold = matrix(0, rows=1, cols=1) 
        
             for (index in 1:len_thresholds) { 
        
               [false_rows, true_rows] = splitRowsVector(X, use_rows_vector, col, possible_thresholds[, index], type) 
        
               num_true_positive = 0; num_false_positive = 0; num_true_negative = 0; num_false_negative = 0 
        
               len = dataVectorLength(use_rows_vector) 
        
               for (c_row in 1:len) { 
        
                 true_row_data = dataVectorGet(true_rows, c_row) 
        
                 false_row_data = dataVectorGet(false_rows, c_row) 
        
                 if (true_row_data != 0 & false_row_data == 0) { # IT'S POSITIVE! 
        
                   if (as.scalar(Y[c_row, 1]) != 0) { 
        
                     num_true_positive = num_true_positive + 1 
        
                   } else { 
        
                     num_false_positive = num_false_positive + 1 
        
                   } 
        
                 } else if (true_row_data == 0 & false_row_data != 0) { # IT'S NEGATIVE 
        
                   if (as.scalar(Y[c_row, 1]) != 0.0) { 
        
                     num_false_negative = num_false_negative + 1 
        
                   } else { 
        
                     num_true_negative = num_true_negative + 1 
        
                   } 
        
                 } 
        
               } 
        
               impurity_positive_branch = calcGiniImpurity(num_true_positive, num_false_positive) 
        
               impurity_negative_branch = calcGiniImpurity(num_true_negative, num_false_negative) 
        
               num_samples = num_true_positive + num_false_positive + num_true_negative + num_false_negative 
        
               num_negative = num_true_negative + num_false_negative 
        
               num_positive = num_true_positive + num_false_positive 
        
               c_impurity = num_positive / num_samples * impurity_positive_branch + num_negative / num_samples * impurity_negative_branch 
        
               if (c_impurity <= impurity) { 
        
                 impurity = c_impurity 
        
                 threshold = possible_thresholds[, index] 
        
               } 
        
             } 
        
           } 
        
           calcBestSplittingCriteria = function( 
        
             Matrix[Double] X, 
        
             Matrix[Double] Y, 
        
             Matrix[Double] R, 
        
             Matrix[Double] use_rows_vector, 
        
             Matrix[Double] use_cols_vector, 
        
             int bins)  return (Double impurity, Double used_col, Matrix[Double] threshold, Double type) { 
        
             impurity = 1 
        
             used_col = 1 
        
             threshold = matrix(0, 1, 1) 
        
             type = 1 
        
             # -- user-defined function calls not supported for iterable predicates 
        
             len = dataVectorLength(use_cols_vector) 
        
             for (c_col in 1:len) { 
        
               use_feature = dataVectorGet(use_cols_vector, c_col) 
        
               if (use_feature != 0) { 
        
                 c_type = getTypeOfCol(R, c_col) 
        
                 [c_impurity, c_threshold] = calcImpurity(X, Y, use_rows_vector, c_col, c_type, bins) 
        
                 if(c_impurity <= impurity) { 
        
                   impurity = c_impurity 
        
                   used_col = c_col 
        
                   threshold = c_threshold 
        
                   type = c_type 
        
                 } 
        
               } 
        
             } 
        
           }

j143

Thank you.

The code looks good. But, need to check whether the implementation is efficient (with edge cases considered) compared to the one implemented in the scripts/builtin/decisionTrees.dml.

I will have a look at the other code, shortly.

morf1us · 2022-02-11T04:55:34Z

Hi @j143 Thanks for taking the time. Yes, it is quite similar. I was aware of both the decisionTree and randomForest implementations before starting.

j143 · 2022-02-11T05:09:44Z

Yes, it is quite similar. I was aware of both the decisionTree and randomForest implementations before starting.

Yes, eventually we need to use Impurity measures inside the the scripts.

j143 · 2022-02-12T05:50:16Z

Thank you, @morf1us - LGTM. 👍
🎉 🚀

We can work on using these measures functions inside the decisionTree scripts, later. Perhaps would you like to take that?

morf1us added 3 commits January 21, 2022 00:26

add testfiles, first draft without binning

4f8f6bf

fixed bugs in impuritymeasures builtin, extended tests

02e6ccf

removed bugs from impurityMeasures builtin, added tests

23b0d9d

added builtin instructions, added tests, code refactoring

a295933

j143 self-requested a review February 9, 2022 06:35

j143 reviewed Feb 10, 2022

View reviewed changes

j143 merged commit 7c3cc82 into apache:main Feb 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYSTEMDS-3184] Builtin for computing information gain using entropy and gini #1520

[SYSTEMDS-3184] Builtin for computing information gain using entropy and gini #1520

morf1us commented Jan 21, 2022

j143 commented Jan 22, 2022 •

edited

Loading

morf1us commented Feb 8, 2022

j143 commented Feb 10, 2022

j143 left a comment

morf1us commented Feb 11, 2022

j143 commented Feb 11, 2022

j143 commented Feb 12, 2022 •

edited

Loading

[SYSTEMDS-3184] Builtin for computing information gain using entropy and gini #1520

[SYSTEMDS-3184] Builtin for computing information gain using entropy and gini #1520

Conversation

morf1us commented Jan 21, 2022

j143 commented Jan 22, 2022 • edited Loading

morf1us commented Feb 8, 2022

j143 commented Feb 10, 2022

j143 left a comment

Choose a reason for hiding this comment

morf1us commented Feb 11, 2022

j143 commented Feb 11, 2022

j143 commented Feb 12, 2022 • edited Loading

j143 commented Jan 22, 2022 •

edited

Loading

j143 commented Feb 12, 2022 •

edited

Loading