Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

While predicting the model doesn't check if data dtypes have changed #3626

Closed
sbushmanov opened this issue Dec 4, 2020 · 6 comments
Closed

Comments

@sbushmanov
Copy link

Summary

Suppose we trained a model with a pandas df, some of the features defined as categorical. Then, if we feed a numpy array, the model silently accepts an array, but produces wrong (?) results. It would be nice to have:

  1. Check if inputs dtypes are the same as at the train time
  2. Error message if input types have changed.

Train demo:

from seaborn import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from lightgbm import LGBMClassifier, Dataset
from scipy.special import logit, expit, softmax
import shap

titanic = load_dataset("titanic")
X = titanic.drop(["survived","alive","adult_male","who",'deck'],1)
y = titanic["survived"]

features = X.columns
cat_features = []
for cat in X.select_dtypes(exclude="number"):
    cat_features.append(cat)
    X[cat] = X[cat].astype("category").cat.codes.astype("category")

X_train, X_val, y_train, y_val = train_test_split(X,y,train_size=.8, random_state=42)

clf = LGBMClassifier(max_depth=3, n_estimators=1000, objective="binary")
clf.fit(X_train,y_train, eval_set=(X_val,y_val), early_stopping_rounds=100, verbose=100, categorical_feature=cat_features) 

Predict on df:

clf.predict_proba(X_train[:1])
# array([[0.81781113, 0.18218887]])

Predict on numpy array (result chnages):

clf.predict_proba(X_train[:1].values)
# array([[0.83461009, 0.16538991]])
@sbushmanov sbushmanov changed the title While predicting the model doesn't check if data structure has changed While predicting the model doesn't check if data dtypes has changed Dec 4, 2020
@sbushmanov sbushmanov changed the title While predicting the model doesn't check if data dtypes has changed While predicting the model doesn't check if data dtypes have changed Dec 4, 2020
@guolinke
Copy link
Collaborator

guolinke commented Dec 4, 2020

For categorical feature in pandas.DF, there is mapping (from categories to integer) saved in model. So if you convert it to numpy without that mapping, it produces the wrong results.

@sbushmanov
Copy link
Author

Thanks for answering. But this is exactly why I'm suggesting this as a feature, not as a bug, because feeding numpy array accepted, but silently produces wrong results.

@guolinke
Copy link
Collaborator

guolinke commented Dec 5, 2020

@sbushmanov I think it is a trade-off. If we only accept the same data type in prediction, using a trained model will be limited.
However, due to the mapping in pandas categorical features, I think we should at least check for that, avoid that mapping being ignored.

@sbushmanov
Copy link
Author

sbushmanov commented Dec 5, 2020

I think issuing at least a Warning is warranted. It took me half an hour to troubleshoot this one without a hint.

@StrikerRUS
Copy link
Collaborator

Adding this as a sub-issue for Check input for prediction item in Feature Requests Hub: #2302.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants