feat(Label_Ecoder):Support of Label Encoder in Multi Target Task #29

tahalpara · 2023-11-06T20:48:07Z

Description: Tried to implement Label Encoder for categorical features in a Multi-target Classification/Regression Scenario. (This is an ongoing issue)

Changes Made: There are 3 files that were edited

model.py.jinja and model_train.py.jinja
evaluation.py.jinja

In the first case, I have added the logic where whenever a catgorical object is found in the features, we implement the Label Encoder thus ensuring all the categorical features are encoded. This solution proposed works for the multi-target scenario. But fails in the scenario where XGB Classifier is used.

I have created a test experiment script as below

from sapientml import SapientML
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

cls = SapientML(
        target_columns=["sex","survived","pclass"],#,"pclass","sibsp"],
                task_type=None,  # suggested automatically from the target columns
               # adaptation_metric='auc'
                )

train_data = pd.read_csv("https://github.com/sapientml/sapientml/files/12481088/titanic.csv")
train_data, test_data = train_test_split(train_data)
y_true = test_data[["sex","survived","pclass"]].reset_index(drop=True)
test_data.drop(["sex","survived","pclass"], axis=1, inplace=True)
cls.fit(train_data, output_dir="./outputs")
y_pred = cls.predict(test_data)

Upon running this I get the below error
ValueError: Invalid classes inferred from unique values of y. Expected: [0 1 2], got [1 2 3]

The above error is specific to XGBClassifier. Upon further investigation I found that Label Encoder does not works well with the latest version of XGBClassifier hence the issue. In this case the solution would be to use other encoding types like One Hot Encoder in the scenario where we have our selected model as XGBClassifier

Below is the reference link of the version issue:
https://stackoverflow.com/questions/71996617/invalid-classes-inferred-from-unique-values-of-y-expected-0-1-2-3-4-5-got

In the second case, when the evaluation metric is not mentioned, sapientML considers the F1 score by default as the metric. To support multi target, there was a need to change the F1 score metric evaluation. I used a for loop to go throw individual target columns and calculate it's F1 score. In this way I was able to solve the evaluation metric error.

Further discussion and course of action needs to be evaluated

AkiraUra

Thank you for your PR.
I have several requests. Could you consider them?

AkiraUra · 2023-11-10T11:22:01Z

sapientml_core/templates/model_templates/model.py.jinja

 {% elif is_multioutput_classification %}
 from sklearn.multioutput import MultiOutputClassifier

 model = MultiOutputClassifier(model)
 {% endif %}
 {% set xgbclassifier = "XGBClassifier" %}
-{% if model_name == xgbclassifier %}


As you mentioned, the target is not (0,1,2,...) and the model is XGBClassifier, an error is raised.
So, we want to use LabelEncoder when

the targets are categorical (you've implemented in this PR),

or,

the model is XGBClassifier.

AkiraUra · 2023-11-10T11:28:52Z

sapientml_core/templates/model_templates/model.py.jinja

-target_train = pd.DataFrame(label_encoder.fit_transform(target_train), columns=TARGET_COLUMNS)
-{% endif %}
-{% if pipeline.task.target_columns|length == 1 %}
+if target_train.select_dtypes(include=['object']).columns.any():


I don't want to show this if-statement to users. I think, in pipeline_template.py, you can use pipeline.dataset_summary["columns"][<target_column>]["meta_features"]["feature:str_catg"] ("feature:str_catg_presence" can be also used) to judge whether LabelEncoder should be used or not (=to build str_columns in the next line).

AkiraUra · 2023-11-10T11:30:51Z

sapientml_core/templates/model_templates/model.py.jinja

+from sklearn.preprocessing import LabelEncoder
+if target_train.select_dtypes(include=['object']).columns.any():
+        str_columns = target_train.select_dtypes(include=['object']).columns
+        label_encoder= LabelEncoder()


I think label_encoder= LabelEncoder() should be move to in the for loop next for the consistency to the training script. That is,

label_encoders = {} for col in str_columns: label_encoder= LabelEncoder() target_train[col] = label_encoder.fit_transform(target_train[col]) target_test[col] = label_encoder.transform(target_test[col]) label_encoders[col] = label_encoder

AkiraUra · 2023-11-10T11:32:04Z

sapientml_core/templates/model_templates/model.py.jinja

@@ -45,4 +58,4 @@ y_pred = model.classes_[np.argmax(y_pred, axis=1)].reshape(-1, 1)
 {% endif %}
 {% if model_name == xgbclassifier and (not pipeline.adaptation_metric.startswith("MAP_")) and (not pipeline.adaptation_metric == "LogLoss") and (pipeline.adaptation_metric not in metric_needing_predict_proba) %}
 y_pred = label_encoder.inverse_transform(y_pred).reshape(-1, 1)


Could you also implement inverse operation for the multicolumn targets? You would need to use label_encoders defined above.

AkiraUra · 2023-11-10T11:35:23Z

sapientml_core/templates/model_templates/model.py.jinja

-{% endif %}
-{% if pipeline.task.target_columns|length == 1 %}
+if target_train.select_dtypes(include=['object']).columns.any():
+        str_columns = target_train.select_dtypes(include=['object']).columns


str_columns should be renamed as users this is target columns. For examples, categorical_target_columns and so on.

AkiraUra · 2023-11-10T11:41:24Z

sapientml_core/templates/model_templates/model_train.py.jinja

+from sklearn.preprocessing import LabelEncoder
+flag=0
+if target_train.select_dtypes(include=['object']).columns.any():
+   str_columns = target_train.select_dtypes(include=['object']).columns
+   label_encoder= LabelEncoder()
+   flag=1
+   for col in str_columns:
+       target_train[col] = label_encoder.fit_transform(target_train[col])
+
+if flag==1:
+    with open('target_LabelEncoder.pkl', 'wb') as f:
+        pickle.dump(label_encoder, f)
+    flag=0


It is not preferable to show the flag variable to users. An implementation example from me is:

{{ % if categorical_target_columns }} categorical_target_column = {{categorical_target_columns}} # passed from pipeline_template.py label_encoders = {} for col in categorical_target_columns: label_encoder = LabelEncoder() target_train[col] = label_encoder.fit_transform(target_train[col]) with open('target_LabelEncoder.pkl', 'wb') as f: pickle.dump(label_encoders, f) {{ % endif }}

AkiraUra · 2023-11-10T11:48:34Z

sapientml_core/templates/model_templates/model.py.jinja

+        label_encoder= LabelEncoder()
+        for col in str_columns:
+                target_train[col] = label_encoder.fit_transform(target_train[col])
+                target_test[col] = label_encoder.transform(target_test[col])


You need not transform target_test, because inverse_transform should called for prediction.

AkiraUra · 2023-11-15T01:29:41Z

@arima-tsukasa This PR relates to your work. Please watch this PR as long as possible.

codecov · 2024-02-06T08:19:41Z

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ No coverage uploaded for pull request base (main@cf5bcba). Click here to learn what that means.

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #29   +/-   ##
=======================================
  Coverage        ?   63.78%           
=======================================
  Files           ?       36           
  Lines           ?     2850           
  Branches        ?        0           
=======================================
  Hits            ?     1818           
  Misses          ?     1032           
  Partials        ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Feature Added Signed-off-by: Trishala Ahalpara <[email protected]>

tahalpara force-pushed the feature/label_encoder branch from 3e08237 to 6bd5119 Compare November 6, 2023 20:57

kimusaku requested review from AkiraUra and kimusaku November 7, 2023 07:04

kimusaku requested a review from a team as a code owner November 8, 2023 00:26

kimusaku enabled auto-merge (squash) November 8, 2023 00:26

kimusaku removed request for kimusaku and AkiraUra November 8, 2023 00:28

AkiraUra requested changes Nov 10, 2023

View reviewed changes

feat(Label_Ecoder):Support of Label Encoder in Multi Target Task

a55ddd1

Feature Added Signed-off-by: Trishala Ahalpara <[email protected]>

auto-merge was automatically disabled February 8, 2024 22:13
Head branch was pushed to by a user without write access

ihkao force-pushed the feature/label_encoder branch 2 times, most recently from 1e91358 to a55ddd1 Compare February 9, 2024 17:36

ihkao mentioned this pull request Feb 9, 2024

fix: Add LabelEncoder while using XGBClassifier and MultiOutputClassifier #59

Merged

AkiraUra closed this in #59 Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(Label_Ecoder):Support of Label Encoder in Multi Target Task #29

feat(Label_Ecoder):Support of Label Encoder in Multi Target Task #29

tahalpara commented Nov 6, 2023

AkiraUra left a comment

AkiraUra Nov 10, 2023

AkiraUra Nov 10, 2023

AkiraUra Nov 10, 2023

AkiraUra Nov 10, 2023

AkiraUra Nov 10, 2023

AkiraUra Nov 10, 2023

AkiraUra Nov 10, 2023

AkiraUra commented Nov 15, 2023 •

edited

Loading

codecov bot commented Feb 6, 2024 •

edited

Loading

feat(Label_Ecoder):Support of Label Encoder in Multi Target Task #29

feat(Label_Ecoder):Support of Label Encoder in Multi Target Task #29

Conversation

tahalpara commented Nov 6, 2023

AkiraUra left a comment

Choose a reason for hiding this comment

AkiraUra Nov 10, 2023

Choose a reason for hiding this comment

AkiraUra Nov 10, 2023

Choose a reason for hiding this comment

AkiraUra Nov 10, 2023

Choose a reason for hiding this comment

AkiraUra Nov 10, 2023

Choose a reason for hiding this comment

AkiraUra Nov 10, 2023

Choose a reason for hiding this comment

AkiraUra Nov 10, 2023

Choose a reason for hiding this comment

AkiraUra Nov 10, 2023

Choose a reason for hiding this comment

AkiraUra commented Nov 15, 2023 • edited Loading

codecov bot commented Feb 6, 2024 • edited Loading

Codecov Report

AkiraUra commented Nov 15, 2023 •

edited

Loading

codecov bot commented Feb 6, 2024 •

edited

Loading