Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] NGBoostRegressor failing when dist="TDistribution" #291

Open
ShreeshaM07 opened this issue May 2, 2024 · 4 comments
Open

[BUG] NGBoostRegressor failing when dist="TDistribution" #291

ShreeshaM07 opened this issue May 2, 2024 · 4 comments
Labels
bug module:regression probabilistic regression module

Comments

@ShreeshaM07
Copy link
Contributor

ShreeshaM07 commented May 2, 2024

Describe the bug

In the gradent_boosting which has an interface of the NGBRegressor in skpro as NGBoostRegressor the TDistribution seems to be failing to run as expected. It is raising errors like

    raise LinAlgError("Singular matrix")
numpy.linalg.LinAlgError: Singular matrix

To Reproduce

Upon using sklearn's diabetes dataset and the breast_cancer dataset it is producing the same Singular Matrix error. To reproduce

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from skpro.regression.gradient_boosting import NGBoostRegressor


# step 1: data specification
X, y = load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, Y_train, Y_test = train_test_split(X, y)
ngb = NGBoostRegressor(dist="TDistribution")._fit(X_train, Y_train)
Y_preds = ngb._predict(X_test)

Y_dists = ngb._pred_dist(X_test)

print(Y_dists)
Y_pred_proba = ngb.predict_proba(X_test)
print(Y_pred_proba)

# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_test)
print('Test MSE', test_MSE)

# test Negative Log Likelihood
test_NLL = -Y_dists.logpdf(Y_test).mean()
print('Test NLL', test_NLL)

Expected behavior

The expected output must look something like this

[iter 0] loss=5.7260 val_loss=0.0000 scale=1.0000 norm=62.6096
[iter 100] loss=5.3862 val_loss=0.0000 scale=1.0000 norm=44.7994
[iter 200] loss=5.1347 val_loss=0.0000 scale=2.0000 norm=70.8354
[iter 300] loss=4.9709 val_loss=0.0000 scale=1.0000 norm=31.4283
[iter 400] loss=4.8448 val_loss=0.0000 scale=2.0000 norm=57.8725
<ngboost.distns.t.TDistribution object at 0x7a306649f010>
TDistribution(columns=Index(['target'], dtype='object'),
       index=Index([394,  76, 398, 154, 164, 409,  86,  57, 248, 252,
       ...
       337,  16, 115, 134, 158, 256, 315,   7, 292, 119],
      dtype='int64', length=111),
       mu=              0
0    204.242902
1    159.767290
2    180.299182
3    157.156834
4    132.029658
..          ...
106  207.598136
107  111.282266
108  142.690431
109   82.266164
110  144.789344

[111 rows x 1 columns],
       sigma=             0
0    22.784403
1    26.722443
2    41.334656
3    32.130065
4    23.862477
..         ...
106  31.425179
107  33.441920
108  24.632183
109  26.791969
110  34.908296

[111 rows x 1 columns])
Test MSE 4077.414567879142
Test NLL 6.473540253400317

Environment

Python 3.11.8
ngboost 0.5.1

Additional context

The issue is to find out whether there is an issue with the interfacing ie the skpro API or genuinely a bug in the ngboost TDistribution itself.

@ShreeshaM07 ShreeshaM07 added the bug label May 2, 2024
@fkiraly fkiraly added the module:regression probabilistic regression module label May 2, 2024
@julian-fong
Copy link
Contributor

I am encountering Singular Matrix errors when doing CI checks for other PRs, wondering if this is related? These are the tests that are failing in #370

FAILED skpro/tests/test_all_estimators.py::TestAllEstimators::test_fit_does_not_overwrite_hyper_params[RandomizedSearchCV-2-ProbaRegressorSurvival] - numpy.linalg.LinAlgError: Singular matrix
FAILED skpro/tests/test_all_estimators.py::TestAllEstimators::test_fit_updates_state[GridSearchCV-2-ProbaRegressorSurvival] - numpy.linalg.LinAlgError: Singular matrix
FAILED skpro/tests/test_all_estimators.py::TestAllEstimators::test_fit_returns_self[RandomizedSearchCV-2-ProbaRegressorSurvival] - numpy.linalg.LinAlgError: Singular matrix
FAILED skpro/tests/test_all_estimators.py::TestAllEstimators::test_fit_does_not_overwrite_hyper_params[GridSearchCV-2-ProbaRegressorSurvival] - numpy.linalg.LinAlgError: Singular matrix
FAILED skpro/tests/test_all_estimators.py::TestAllEstimators::test_fit_updates_state[RandomizedSearchCV-2-ProbaRegressorSurvival] - numpy.linalg.LinAlgError: Singular matrix
FAILED skpro/tests/test_all_estimators.py::TestAllEstimators::test_fit_returns_self[GridSearchCV-2-ProbaRegressorSurvival] - numpy.linalg.LinAlgError: Singular matrix

@fkiraly
Copy link
Collaborator

fkiraly commented Jun 13, 2024

Hm, I think this is due to the CoxPH used in parameter set 2 which is not robust when used on a small dataset.

We could:

  • use another estimator from one of the other deps
  • try to replace with a survival model from skpro without soft dependencies. Currently, the only such models are composites, using, say, ConditionUncensored wrapping ResidualDouble or EnbPI.

@julian-fong
Copy link
Contributor

Do you have a particular preference? I'm not too familiar with survival models so recommendations would be helpful here

fkiraly pushed a commit that referenced this issue Jun 14, 2024
…` instead of `CoxPH` (#388)

#### Reference Issues/PRs

Fixes #387 . Changed paramset3 to use `ConditionUncensored` instead of
`CoxPH` since it doesn't seem stable on smaller datasets.

Discussion thread on #291
@fkiraly
Copy link
Collaborator

fkiraly commented Jun 14, 2024

summarizing ealrier discussion today, any survival model without soft deps and numerically stable on small data should do for the purpose of smooth testing. ResidualDouble with LinearRegression or similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug module:regression probabilistic regression module
Projects
None yet
Development

No branches or pull requests

3 participants