-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(Label_Ecoder):Support of Label Encoder in Multi Target Task #29
Conversation
3e08237
to
6bd5119
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your PR.
I have several requests. Could you consider them?
{% elif is_multioutput_classification %} | ||
from sklearn.multioutput import MultiOutputClassifier | ||
|
||
model = MultiOutputClassifier(model) | ||
{% endif %} | ||
{% set xgbclassifier = "XGBClassifier" %} | ||
{% if model_name == xgbclassifier %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you mentioned, the target is not (0,1,2,...) and the model is XGBClassifier, an error is raised.
So, we want to use LabelEncoder when
- the targets are categorical (you've implemented in this PR),
or,
- the model is XGBClassifier.
target_train = pd.DataFrame(label_encoder.fit_transform(target_train), columns=TARGET_COLUMNS) | ||
{% endif %} | ||
{% if pipeline.task.target_columns|length == 1 %} | ||
if target_train.select_dtypes(include=['object']).columns.any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to show this if-statement to users. I think, in pipeline_template.py
, you can use pipeline.dataset_summary["columns"][<target_column>]["meta_features"]["feature:str_catg"]
("feature:str_catg_presence"
can be also used) to judge whether LabelEncoder should be used or not (=to build str_columns
in the next line).
from sklearn.preprocessing import LabelEncoder | ||
if target_train.select_dtypes(include=['object']).columns.any(): | ||
str_columns = target_train.select_dtypes(include=['object']).columns | ||
label_encoder= LabelEncoder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think label_encoder= LabelEncoder()
should be move to in the for loop next for the consistency to the training script. That is,
label_encoders = {}
for col in str_columns:
label_encoder= LabelEncoder()
target_train[col] = label_encoder.fit_transform(target_train[col])
target_test[col] = label_encoder.transform(target_test[col])
label_encoders[col] = label_encoder
@@ -45,4 +58,4 @@ y_pred = model.classes_[np.argmax(y_pred, axis=1)].reshape(-1, 1) | |||
{% endif %} | |||
{% if model_name == xgbclassifier and (not pipeline.adaptation_metric.startswith("MAP_")) and (not pipeline.adaptation_metric == "LogLoss") and (pipeline.adaptation_metric not in metric_needing_predict_proba) %} | |||
y_pred = label_encoder.inverse_transform(y_pred).reshape(-1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also implement inverse operation for the multicolumn targets? You would need to use label_encoders
defined above.
{% endif %} | ||
{% if pipeline.task.target_columns|length == 1 %} | ||
if target_train.select_dtypes(include=['object']).columns.any(): | ||
str_columns = target_train.select_dtypes(include=['object']).columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str_columns
should be renamed as users this is target columns. For examples, categorical_target_columns
and so on.
from sklearn.preprocessing import LabelEncoder | ||
flag=0 | ||
if target_train.select_dtypes(include=['object']).columns.any(): | ||
str_columns = target_train.select_dtypes(include=['object']).columns | ||
label_encoder= LabelEncoder() | ||
flag=1 | ||
for col in str_columns: | ||
target_train[col] = label_encoder.fit_transform(target_train[col]) | ||
|
||
if flag==1: | ||
with open('target_LabelEncoder.pkl', 'wb') as f: | ||
pickle.dump(label_encoder, f) | ||
flag=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not preferable to show the flag
variable to users. An implementation example from me is:
{{ % if categorical_target_columns }}
categorical_target_column = {{categorical_target_columns}} # passed from pipeline_template.py
label_encoders = {}
for col in categorical_target_columns:
label_encoder = LabelEncoder()
target_train[col] = label_encoder.fit_transform(target_train[col])
with open('target_LabelEncoder.pkl', 'wb') as f:
pickle.dump(label_encoders, f)
{{ % endif }}
label_encoder= LabelEncoder() | ||
for col in str_columns: | ||
target_train[col] = label_encoder.fit_transform(target_train[col]) | ||
target_test[col] = label_encoder.transform(target_test[col]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need not transform target_test
, because inverse_transform
should called for prediction.
@arima-tsukasa This PR relates to your work. Please watch this PR as long as possible. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #29 +/- ##
=======================================
Coverage ? 63.78%
=======================================
Files ? 36
Lines ? 2850
Branches ? 0
=======================================
Hits ? 1818
Misses ? 1032
Partials ? 0 ☔ View full report in Codecov by Sentry. |
Feature Added Signed-off-by: Trishala Ahalpara <[email protected]>
Head branch was pushed to by a user without write access
1e91358
to
a55ddd1
Compare
Description: Tried to implement Label Encoder for categorical features in a Multi-target Classification/Regression Scenario. (This is an ongoing issue)
Changes Made: There are 3 files that were edited
In the first case, I have added the logic where whenever a catgorical object is found in the features, we implement the Label Encoder thus ensuring all the categorical features are encoded. This solution proposed works for the multi-target scenario. But fails in the scenario where XGB Classifier is used.
I have created a test experiment script as below
Upon running this I get the below error
ValueError: Invalid classes inferred from unique values of
y. Expected: [0 1 2], got [1 2 3]
The above error is specific to XGBClassifier. Upon further investigation I found that Label Encoder does not works well with the latest version of XGBClassifier hence the issue. In this case the solution would be to use other encoding types like One Hot Encoder in the scenario where we have our selected model as XGBClassifier
Below is the reference link of the version issue:
https://stackoverflow.com/questions/71996617/invalid-classes-inferred-from-unique-values-of-y-expected-0-1-2-3-4-5-got
In the second case, when the evaluation metric is not mentioned, sapientML considers the F1 score by default as the metric. To support multi target, there was a need to change the F1 score metric evaluation. I used a for loop to go throw individual target columns and calculate it's F1 score. In this way I was able to solve the evaluation metric error.
Further discussion and course of action needs to be evaluated