Issue with array dimension error in regression models #1297

PanyiDong · 2021-11-09T22:15:01Z

Describe the bug

I'm calling some of the regression methods provided in auto-sklearn for my project and the error shows when using mlp/libsvm_svr/sgd, the exact error message is (omitted the returned 1D array):

~/anaconda3/lib/python3.8/site-packages/autosklearn/pipeline/components/regression/libsvm_svr.py in predict(self, X)
    100             raise NotImplementedError
    101         Y_pred = self.estimator.predict(X)
--> 102         return self.scaler.inverse_transform(Y_pred)
    103 
    104     @staticmethod

~/anaconda3/lib/python3.8/site-packages/sklearn/preprocessing/_data.py in inverse_transform(self, X, copy)
   1014 
   1015         copy = copy if copy is not None else self.copy
-> 1016         X = check_array(
   1017             X,
   1018             accept_sparse="csr",

~/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    759             # If input is 1D raise error
    760             if array.ndim == 1:
--> 761                 raise ValueError(
    762                     "Expected 2D array, got 1D array instead:\narray={}.\n"
    763                     "Reshape your data either using array.reshape(-1, 1) if "

ValueError: Expected 2D array, got 1D array instead:

for autosklearn/pipeline/components/regression/mlp.py, autosklearn/pipeline/components/regression/libsvm_svr.py and autosklearn/pipeline/components/regression/sgd.py

To Reproduce

Test data: https://www.kaggle.com/tejashvi14/medical-insurance-premium-prediction/download
Using "PremiumPrice" as response/y and other variables as features/X

Call above three models with fit, predict workflow. The above message will appears at predict stage.
Or, I tried using AutoSklearnRegressor
Fit stage (Time limit just to save time, I don't expect it can return anything meaningful.)

from autosklearn.regression import AutoSklearnRegressor
reg = AutoSklearnRegressor(
    time_left_for_this_task = 360,
    include = {'regressor' : ['mlp']}
)
reg.fit(data[features], data[[response]])

Predict Stage

reg.predict(data[features], data[[response]])

The training stage will return enormous amount of [WARNING] [2021-11-09 15:14:31,628:Client-AutoMLSMBO(1)::079213e7-41a2-11ec-97c8-00155d1712a6] Configuration 119 not found (with different numbers at 119 position).
And for AutoSklearnRegressor, predict will just return a (n_sample, ) numpy array with all same elements (close to mean of response but not exact the same), which I don't think is completed as intended.

Returns of the test predict stage (only taken first few lines, others are just the same)

array([24110.60546875, 24110.60546875, 24110.60546875, 24110.60546875,
       24110.60546875, 24110.60546875, 24110.60546875, 24110.60546875,
       24110.60546875, 24110.60546875, 24110.60546875, 24110.60546875,
       24110.60546875, 24110.60546875, 24110.60546875, 24110.60546875,

Reason for the Problem

I think the problem is caused by standardization (sklearn.preprocessing.StandardScaler) used in autosklearn/pipeline/components/regression/mlp.py, autosklearn/pipeline/components/regression/libsvm_svr.py and autosklearn/pipeline/components/regression/sgd.py

Code below extracted from autosklearn/pipeline/components/regression/sgd.py, iterative_fit, line 92-95

self.scaler = sklearn.preprocessing.StandardScaler(copy=True)
self.scaler.fit(y.reshape((-1, 1)))
Y_scaled = self.scaler.transform(y.reshape((-1, 1))).ravel()
self.estimator.fit(X, Y_scaled)

And in predict method, line 131-132

Y_pred = self.estimator.predict(X)
return self.scaler.inverse_transform(Y_pred)

Y_pred is returned by predict method, a (n_sample, ) numpy array, while the inverse_transform of StandardScaler requires a (n_sample, 1) array. Correction should be something like:

Y_pred = self.estimator.predict(X)
return self.scaler.inverse_transform(Y_pred.reshape(-1, 1)).ravel()

I think mlp/libsvm_svr have the same problem.

Environment and installation:

OS: Windows 11 Education, OS build 22000.282, WSL version 2 with Ubuntu 20.04.3 LTS (run on WSL)
Conda version: 4.10.3
Python version: 3.8.8
Sklearn version: 1.0.1
Auto-sklearn version: 0.14.0

The text was updated successfully, but these errors were encountered:

eddiebergman · 2021-11-09T23:22:52Z

Hi @PanyiDong,

Seems interesting and at a glance I'm not sure why this hasn't been an issue before, it would make sense that the estimator predicts a 1d output [1,2,3, ...] and that this should be expanded to [[1], [2], [3], ...] before being passed to the inverse transform of the scalar.

For reference StandardScaler docs.

This is further confirmed by checking the source code of inverse_transform of StandardScaler which uses check_array with ensure_2d=True.

Your solution should work for single output regression but I'll need to test properly to make a solution that also works for multi-output regression. I'll also have to check why the tests have not caught this before.

Many thanks,
Eddie

eddiebergman · 2021-12-07T16:52:39Z

Hi @PanyiDong,

Sorry for the slow response to this. Turns out that indeed it was the StandardScaler causing issues and it was an artifact of when auto-sklearn was updated to allow for multi-target regression but the models were not updated to check input dimensions. This has been fixed in #1335

eddiebergman · 2021-12-13T13:50:12Z

Fixed with #1335

eddiebergman added the bug label Nov 9, 2021

eddiebergman mentioned this issue Dec 7, 2021

Fix regression algorithms to give correct output dimensions #1335

Merged

eddiebergman linked a pull request Dec 7, 2021 that will close this issue

Fix regression algorithms to give correct output dimensions #1335

Merged

eddiebergman closed this as completed Dec 13, 2021

This was referenced Jan 24, 2022

V0.14.4 #1378

Merged

V0.14.4 #1379

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with array dimension error in regression models #1297

Issue with array dimension error in regression models #1297

PanyiDong commented Nov 9, 2021

eddiebergman commented Nov 9, 2021

eddiebergman commented Dec 7, 2021

eddiebergman commented Dec 13, 2021

Issue with array dimension error in regression models #1297

Issue with array dimension error in regression models #1297

Comments

PanyiDong commented Nov 9, 2021

Describe the bug

To Reproduce

Reason for the Problem

Environment and installation:

eddiebergman commented Nov 9, 2021

eddiebergman commented Dec 7, 2021

eddiebergman commented Dec 13, 2021