Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(Label_Ecoder):Support of Label Encoder in Multi Target Task #29

Closed
wants to merge 1 commit into from

Conversation

tahalpara
Copy link

Description: Tried to implement Label Encoder for categorical features in a Multi-target Classification/Regression Scenario. (This is an ongoing issue)

Changes Made: There are 3 files that were edited

  1. model.py.jinja and model_train.py.jinja
  2. evaluation.py.jinja

In the first case, I have added the logic where whenever a catgorical object is found in the features, we implement the Label Encoder thus ensuring all the categorical features are encoded. This solution proposed works for the multi-target scenario. But fails in the scenario where XGB Classifier is used.

I have created a test experiment script as below

from sapientml import SapientML
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

cls = SapientML(
        target_columns=["sex","survived","pclass"],#,"pclass","sibsp"],
                task_type=None,  # suggested automatically from the target columns
               # adaptation_metric='auc'
                )

train_data = pd.read_csv("https://github.com/sapientml/sapientml/files/12481088/titanic.csv")
train_data, test_data = train_test_split(train_data)
y_true = test_data[["sex","survived","pclass"]].reset_index(drop=True)
test_data.drop(["sex","survived","pclass"], axis=1, inplace=True)
cls.fit(train_data, output_dir="./outputs")
y_pred = cls.predict(test_data)

Upon running this I get the below error
ValueError: Invalid classes inferred from unique values of y. Expected: [0 1 2], got [1 2 3]

The above error is specific to XGBClassifier. Upon further investigation I found that Label Encoder does not works well with the latest version of XGBClassifier hence the issue. In this case the solution would be to use other encoding types like One Hot Encoder in the scenario where we have our selected model as XGBClassifier

Below is the reference link of the version issue:
https://stackoverflow.com/questions/71996617/invalid-classes-inferred-from-unique-values-of-y-expected-0-1-2-3-4-5-got

In the second case, when the evaluation metric is not mentioned, sapientML considers the F1 score by default as the metric. To support multi target, there was a need to change the F1 score metric evaluation. I used a for loop to go throw individual target columns and calculate it's F1 score. In this way I was able to solve the evaluation metric error.

Further discussion and course of action needs to be evaluated

@tahalpara tahalpara force-pushed the feature/label_encoder branch from 3e08237 to 6bd5119 Compare November 6, 2023 20:57
@kimusaku kimusaku requested review from AkiraUra and kimusaku November 7, 2023 07:04
@kimusaku kimusaku requested a review from a team as a code owner November 8, 2023 00:26
@kimusaku kimusaku enabled auto-merge (squash) November 8, 2023 00:26
Copy link
Contributor

@AkiraUra AkiraUra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR.
I have several requests. Could you consider them?

{% elif is_multioutput_classification %}
from sklearn.multioutput import MultiOutputClassifier

model = MultiOutputClassifier(model)
{% endif %}
{% set xgbclassifier = "XGBClassifier" %}
{% if model_name == xgbclassifier %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned, the target is not (0,1,2,...) and the model is XGBClassifier, an error is raised.
So, we want to use LabelEncoder when

  • the targets are categorical (you've implemented in this PR),

or,

  • the model is XGBClassifier.

target_train = pd.DataFrame(label_encoder.fit_transform(target_train), columns=TARGET_COLUMNS)
{% endif %}
{% if pipeline.task.target_columns|length == 1 %}
if target_train.select_dtypes(include=['object']).columns.any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to show this if-statement to users. I think, in pipeline_template.py, you can use pipeline.dataset_summary["columns"][<target_column>]["meta_features"]["feature:str_catg"] ("feature:str_catg_presence" can be also used) to judge whether LabelEncoder should be used or not (=to build str_columns in the next line).

from sklearn.preprocessing import LabelEncoder
if target_train.select_dtypes(include=['object']).columns.any():
str_columns = target_train.select_dtypes(include=['object']).columns
label_encoder= LabelEncoder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think label_encoder= LabelEncoder() should be move to in the for loop next for the consistency to the training script. That is,

        label_encoders = {}
        for col in str_columns:
                label_encoder= LabelEncoder()
                target_train[col] = label_encoder.fit_transform(target_train[col])
                target_test[col] = label_encoder.transform(target_test[col])
                label_encoders[col] = label_encoder

@@ -45,4 +58,4 @@ y_pred = model.classes_[np.argmax(y_pred, axis=1)].reshape(-1, 1)
{% endif %}
{% if model_name == xgbclassifier and (not pipeline.adaptation_metric.startswith("MAP_")) and (not pipeline.adaptation_metric == "LogLoss") and (pipeline.adaptation_metric not in metric_needing_predict_proba) %}
y_pred = label_encoder.inverse_transform(y_pred).reshape(-1, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also implement inverse operation for the multicolumn targets? You would need to use label_encoders defined above.

{% endif %}
{% if pipeline.task.target_columns|length == 1 %}
if target_train.select_dtypes(include=['object']).columns.any():
str_columns = target_train.select_dtypes(include=['object']).columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str_columns should be renamed as users this is target columns. For examples, categorical_target_columns and so on.

Comment on lines +46 to +58
from sklearn.preprocessing import LabelEncoder
flag=0
if target_train.select_dtypes(include=['object']).columns.any():
str_columns = target_train.select_dtypes(include=['object']).columns
label_encoder= LabelEncoder()
flag=1
for col in str_columns:
target_train[col] = label_encoder.fit_transform(target_train[col])

if flag==1:
with open('target_LabelEncoder.pkl', 'wb') as f:
pickle.dump(label_encoder, f)
flag=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not preferable to show the flag variable to users. An implementation example from me is:

{{ % if categorical_target_columns }}
categorical_target_column = {{categorical_target_columns}} # passed from pipeline_template.py
label_encoders = {}
for col in categorical_target_columns:
    label_encoder = LabelEncoder()
    target_train[col] = label_encoder.fit_transform(target_train[col])

with open('target_LabelEncoder.pkl', 'wb') as f:
        pickle.dump(label_encoders, f)
{{ % endif }}

label_encoder= LabelEncoder()
for col in str_columns:
target_train[col] = label_encoder.fit_transform(target_train[col])
target_test[col] = label_encoder.transform(target_test[col])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need not transform target_test, because inverse_transform should called for prediction.

@AkiraUra
Copy link
Contributor

AkiraUra commented Nov 15, 2023

@arima-tsukasa This PR relates to your work. Please watch this PR as long as possible.

Copy link

codecov bot commented Feb 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ No coverage uploaded for pull request base (main@cf5bcba). Click here to learn what that means.

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #29   +/-   ##
=======================================
  Coverage        ?   63.78%           
=======================================
  Files           ?       36           
  Lines           ?     2850           
  Branches        ?        0           
=======================================
  Hits            ?     1818           
  Misses          ?     1032           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

auto-merge was automatically disabled February 8, 2024 22:13

Head branch was pushed to by a user without write access

@ihkao ihkao force-pushed the feature/label_encoder branch 2 times, most recently from 1e91358 to a55ddd1 Compare February 9, 2024 17:36
@AkiraUra AkiraUra closed this in #59 Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants