Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with array dimension error in regression models #1297

Closed
PanyiDong opened this issue Nov 9, 2021 · 3 comments · Fixed by #1335
Closed

Issue with array dimension error in regression models #1297

PanyiDong opened this issue Nov 9, 2021 · 3 comments · Fixed by #1335
Labels

Comments

@PanyiDong
Copy link

Describe the bug

I'm calling some of the regression methods provided in auto-sklearn for my project and the error shows when using mlp/libsvm_svr/sgd, the exact error message is (omitted the returned 1D array):

~/anaconda3/lib/python3.8/site-packages/autosklearn/pipeline/components/regression/libsvm_svr.py in predict(self, X)
    100             raise NotImplementedError
    101         Y_pred = self.estimator.predict(X)
--> 102         return self.scaler.inverse_transform(Y_pred)
    103 
    104     @staticmethod

~/anaconda3/lib/python3.8/site-packages/sklearn/preprocessing/_data.py in inverse_transform(self, X, copy)
   1014 
   1015         copy = copy if copy is not None else self.copy
-> 1016         X = check_array(
   1017             X,
   1018             accept_sparse="csr",

~/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    759             # If input is 1D raise error
    760             if array.ndim == 1:
--> 761                 raise ValueError(
    762                     "Expected 2D array, got 1D array instead:\narray={}.\n"
    763                     "Reshape your data either using array.reshape(-1, 1) if "

ValueError: Expected 2D array, got 1D array instead:

for autosklearn/pipeline/components/regression/mlp.py, autosklearn/pipeline/components/regression/libsvm_svr.py and autosklearn/pipeline/components/regression/sgd.py

To Reproduce

Test data: https://www.kaggle.com/tejashvi14/medical-insurance-premium-prediction/download
Using "PremiumPrice" as response/y and other variables as features/X

  1. Call above three models with fit, predict workflow. The above message will appears at predict stage.
  2. Or, I tried using AutoSklearnRegressor
    Fit stage (Time limit just to save time, I don't expect it can return anything meaningful.)
from autosklearn.regression import AutoSklearnRegressor
reg = AutoSklearnRegressor(
    time_left_for_this_task = 360,
    include = {'regressor' : ['mlp']}
)
reg.fit(data[features], data[[response]])

Predict Stage

reg.predict(data[features], data[[response]])

The training stage will return enormous amount of [WARNING] [2021-11-09 15:14:31,628:Client-AutoMLSMBO(1)::079213e7-41a2-11ec-97c8-00155d1712a6] Configuration 119 not found (with different numbers at 119 position).
And for AutoSklearnRegressor, predict will just return a (n_sample, ) numpy array with all same elements (close to mean of response but not exact the same), which I don't think is completed as intended.

Returns of the test predict stage (only taken first few lines, others are just the same)

array([24110.60546875, 24110.60546875, 24110.60546875, 24110.60546875,
       24110.60546875, 24110.60546875, 24110.60546875, 24110.60546875,
       24110.60546875, 24110.60546875, 24110.60546875, 24110.60546875,
       24110.60546875, 24110.60546875, 24110.60546875, 24110.60546875,

Reason for the Problem

I think the problem is caused by standardization (sklearn.preprocessing.StandardScaler) used in autosklearn/pipeline/components/regression/mlp.py, autosklearn/pipeline/components/regression/libsvm_svr.py and autosklearn/pipeline/components/regression/sgd.py

Code below extracted from autosklearn/pipeline/components/regression/sgd.py, iterative_fit, line 92-95

self.scaler = sklearn.preprocessing.StandardScaler(copy=True)
self.scaler.fit(y.reshape((-1, 1)))
Y_scaled = self.scaler.transform(y.reshape((-1, 1))).ravel()
self.estimator.fit(X, Y_scaled)

And in predict method, line 131-132

Y_pred = self.estimator.predict(X)
return self.scaler.inverse_transform(Y_pred)

Y_pred is returned by predict method, a (n_sample, ) numpy array, while the inverse_transform of StandardScaler requires a (n_sample, 1) array. Correction should be something like:

Y_pred = self.estimator.predict(X)
return self.scaler.inverse_transform(Y_pred.reshape(-1, 1)).ravel()

I think mlp/libsvm_svr have the same problem.

Environment and installation:

  • OS: Windows 11 Education, OS build 22000.282, WSL version 2 with Ubuntu 20.04.3 LTS (run on WSL)
  • Conda version: 4.10.3
  • Python version: 3.8.8
  • Sklearn version: 1.0.1
  • Auto-sklearn version: 0.14.0
@eddiebergman
Copy link
Contributor

Hi @PanyiDong,

Seems interesting and at a glance I'm not sure why this hasn't been an issue before, it would make sense that the estimator predicts a 1d output [1,2,3, ...] and that this should be expanded to [[1], [2], [3], ...] before being passed to the inverse transform of the scalar.

For reference StandardScaler docs.

This is further confirmed by checking the source code of inverse_transform of StandardScaler which uses check_array with ensure_2d=True.

Your solution should work for single output regression but I'll need to test properly to make a solution that also works for multi-output regression. I'll also have to check why the tests have not caught this before.

Many thanks,
Eddie

@eddiebergman
Copy link
Contributor

Hi @PanyiDong,

Sorry for the slow response to this. Turns out that indeed it was the StandardScaler causing issues and it was an artifact of when auto-sklearn was updated to allow for multi-target regression but the models were not updated to check input dimensions. This has been fixed in #1335

@eddiebergman
Copy link
Contributor

Fixed with #1335

This was referenced Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants