Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sklearn-API: better description for syntax errors #4875

Closed
cibic89 opened this issue Sep 19, 2019 · 3 comments · Fixed by #4929
Closed

Sklearn-API: better description for syntax errors #4875

cibic89 opened this issue Sep 19, 2019 · 3 comments · Fixed by #4929

Comments

@cibic89
Copy link

cibic89 commented Sep 19, 2019

First reported here.
Windows 2016 Server/Windows 10, Anaconda v2019.07, Jupyter lab v1.0.2, xgboost v0.9
Syntax errors should give better descriptions when using the sklearn API. Let me illustrate:

import pandas as pd
from sklearn.preprocessing import LabelEncoder as le
from xgboost import XGBClassifier

## Make a dummy dataframe
df1 = pd.DataFrame(
    {"x1": [0,4,2,5],
     "x2": [10,14,12,15],
     "x3": [20,24,22,25],
     "y": ["a", "b", "missing", "z"]}
)
for i in range(15):
    df1 = df1.append(df1, ignore_index = True)
print(df1.shape)
display(df1.head()) # can't remove "missing" response example as I need to prepare test set for prediction too and if I leave in NaN's label encoder will throw an error

## Make another dataframe encoding object/categorical data
df2 = df1.copy()
resp_var_le = le()
df2["y"] = resp_var_le.fit_transform(df2["y"])
print(resp_var_le.classes_)
display(df2.head())

## Prepare for xgboost
n_trees = 10
verbosity = 1
tree_method = "approx"
learning_rate = 0.3
max_depth = 6
random_state = 123
objective = 'multi:softprob'
eval_metric = "mlogloss"

## Fit a model with nothing missing
X = df2.drop(columns = ["y"])
y = df2["y"]
xgb_model = XGBClassifier(max_depth = max_depth, learning_rate = learning_rate, n_estimators = n_trees, verbosity = verbosity
                          ,objective = objective, nthread = -1, random_state = random_state, missing = resp_var_le.transform(["missing"])[0])
xgb_model.fit(X = X, y = y, eval_set = [(X, y)], eval_metric = eval_metric, early_stopping_rounds = 10)

The above code works fine but when do either change from the list below:
xgb_model.fit(X = X, y = y, eval_set = [X, y], eval_metric = eval_metric, early_stopping_rounds = 10)
or
xgb_model.fit(X = X, y = y, eval_metric = eval_metric, early_stopping_rounds = 10)
And both will give you this unintuitive error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-1-abc953dedfab> in <module>
     46 X = df3.drop(columns = ["y"])
     47 y = df3["y"]
---> 48 xgb_model.fit(X = X, y = y, eval_set = [X, y], eval_metric = eval_metric, early_stopping_rounds = 10)

~\Anaconda3\lib\site-packages\xgboost\sklearn.py in fit(self, X, y, sample_weight, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, callbacks)
    709                         missing=self.missing, weight=sample_weight_eval_set[i],
    710                         nthread=self.n_jobs)
--> 711                 for i in range(len(eval_set))
    712             )
    713             nevals = len(evals)

~\Anaconda3\lib\site-packages\xgboost\sklearn.py in <genexpr>(.0)
    709                         missing=self.missing, weight=sample_weight_eval_set[i],
    710                         nthread=self.n_jobs)
--> 711                 for i in range(len(eval_set))
    712             )
    713             nevals = len(evals)

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2978             if self.columns.nlevels > 1:
   2979                 return self._getitem_multilevel(key)
-> 2980             indexer = self.columns.get_loc(key)
   2981             if is_integer(indexer):
   2982                 indexer = [indexer]

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0
@cibic89 cibic89 changed the title Sklearn-API: better descriptions with syntax errors Sklearn-API: better description for syntax errors Sep 19, 2019
@trivialfis
Copy link
Member

We can add more checks for it like whether it's a list containing tuple, but might be too rigid. Or something like iterable... Do you have any suggestions?

@cibic89
Copy link
Author

cibic89 commented Sep 19, 2019 via email

@trivialfis
Copy link
Member

@cibic89 Thanks. See if we can polish the interface more.

trivialfis added a commit to trivialfis/xgboost that referenced this issue Oct 10, 2019
trivialfis added a commit that referenced this issue Oct 12, 2019
* Remove nthread, seed, silent. Add tree_method, gpu_id, num_parallel_tree. Fix #4909.
* Check data shape. Fix #4896.
* Check element of eval_set is tuple. Fix #4875
*  Add doc for random_state with hogwild. Fixes #4919
@lock lock bot locked as resolved and limited conversation to collaborators Jan 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants