Sklearn-API: better description for syntax errors #4875

cibic89 · 2019-09-19T14:11:47Z

First reported here.
Windows 2016 Server/Windows 10, Anaconda v2019.07, Jupyter lab v1.0.2, xgboost v0.9
Syntax errors should give better descriptions when using the sklearn API. Let me illustrate:

import pandas as pd
from sklearn.preprocessing import LabelEncoder as le
from xgboost import XGBClassifier

## Make a dummy dataframe
df1 = pd.DataFrame(
    {"x1": [0,4,2,5],
     "x2": [10,14,12,15],
     "x3": [20,24,22,25],
     "y": ["a", "b", "missing", "z"]}
)
for i in range(15):
    df1 = df1.append(df1, ignore_index = True)
print(df1.shape)
display(df1.head()) # can't remove "missing" response example as I need to prepare test set for prediction too and if I leave in NaN's label encoder will throw an error

## Make another dataframe encoding object/categorical data
df2 = df1.copy()
resp_var_le = le()
df2["y"] = resp_var_le.fit_transform(df2["y"])
print(resp_var_le.classes_)
display(df2.head())

## Prepare for xgboost
n_trees = 10
verbosity = 1
tree_method = "approx"
learning_rate = 0.3
max_depth = 6
random_state = 123
objective = 'multi:softprob'
eval_metric = "mlogloss"

## Fit a model with nothing missing
X = df2.drop(columns = ["y"])
y = df2["y"]
xgb_model = XGBClassifier(max_depth = max_depth, learning_rate = learning_rate, n_estimators = n_trees, verbosity = verbosity
                          ,objective = objective, nthread = -1, random_state = random_state, missing = resp_var_le.transform(["missing"])[0])
xgb_model.fit(X = X, y = y, eval_set = [(X, y)], eval_metric = eval_metric, early_stopping_rounds = 10)

The above code works fine but when do either change from the list below:
xgb_model.fit(X = X, y = y, eval_set = [X, y], eval_metric = eval_metric, early_stopping_rounds = 10)
or
xgb_model.fit(X = X, y = y, eval_metric = eval_metric, early_stopping_rounds = 10)
And both will give you this unintuitive error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-1-abc953dedfab> in <module>
     46 X = df3.drop(columns = ["y"])
     47 y = df3["y"]
---> 48 xgb_model.fit(X = X, y = y, eval_set = [X, y], eval_metric = eval_metric, early_stopping_rounds = 10)

~\Anaconda3\lib\site-packages\xgboost\sklearn.py in fit(self, X, y, sample_weight, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, callbacks)
    709                         missing=self.missing, weight=sample_weight_eval_set[i],
    710                         nthread=self.n_jobs)
--> 711                 for i in range(len(eval_set))
    712             )
    713             nevals = len(evals)

~\Anaconda3\lib\site-packages\xgboost\sklearn.py in <genexpr>(.0)
    709                         missing=self.missing, weight=sample_weight_eval_set[i],
    710                         nthread=self.n_jobs)
--> 711                 for i in range(len(eval_set))
    712             )
    713             nevals = len(evals)

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2978             if self.columns.nlevels > 1:
   2979                 return self._getitem_multilevel(key)
-> 2980             indexer = self.columns.get_loc(key)
   2981             if is_integer(indexer):
   2982                 indexer = [indexer]

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The text was updated successfully, but these errors were encountered:

trivialfis · 2019-09-19T16:53:47Z

We can add more checks for it like whether it's a list containing tuple, but might be too rigid. Or something like iterable... Do you have any suggestions?

cibic89 · 2019-09-19T17:56:18Z

The easiest option is to return the native or near native error. Pretty content personally with the python API error descriptions returned.

…

________________________________ From: Jiaming Yuan <[email protected]> Sent: Thursday, September 19, 2019 5:54:40 PM To: dmlc/xgboost <[email protected]> Cc: George J <[email protected]>; Author <[email protected]> Subject: Re: [dmlc/xgboost] Sklearn-API: better description for syntax errors (#4875) We can add more checks for it like whether it's a list containing tuple, but might be too rigid. Or something like iterable... Do you have any suggestions? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#4875?email_source=notifications&email_token=AFM4L5YL7PH74O5LNSBT4ZLQKOVFBA5CNFSM4IYLVZB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7EEISA#issuecomment-533218376>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AFM4L57SBA3GMP4UOO5OZYDQKOVFBANCNFSM4IYLVZBQ>.

trivialfis · 2019-09-20T09:01:24Z

@cibic89 Thanks. See if we can polish the interface more.

Fix dmlc#4875

* Remove nthread, seed, silent. Add tree_method, gpu_id, num_parallel_tree. Fix #4909. * Check data shape. Fix #4896. * Check element of eval_set is tuple. Fix #4875 * Add doc for random_state with hogwild. Fixes #4919

cibic89 changed the title ~~Sklearn-API: better descriptions with syntax errors~~ Sklearn-API: better description for syntax errors Sep 19, 2019

trivialfis added a commit to trivialfis/xgboost that referenced this issue Oct 10, 2019

Check eval_set is tuple.

289f8b3

Fix dmlc#4875

trivialfis mentioned this issue Oct 10, 2019

[Breaking] Update sklearn interface. #4929

Merged

trivialfis closed this as completed in #4929 Oct 12, 2019

lock bot locked as resolved and limited conversation to collaborators Jan 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sklearn-API: better description for syntax errors #4875

Sklearn-API: better description for syntax errors #4875

cibic89 commented Sep 19, 2019 •

edited

Loading

trivialfis commented Sep 19, 2019

cibic89 commented Sep 19, 2019 via email

trivialfis commented Sep 20, 2019

Sklearn-API: better description for syntax errors #4875

Sklearn-API: better description for syntax errors #4875

Comments

cibic89 commented Sep 19, 2019 • edited Loading

trivialfis commented Sep 19, 2019

cibic89 commented Sep 19, 2019 via email

trivialfis commented Sep 20, 2019

cibic89 commented Sep 19, 2019 •

edited

Loading