Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] interface to ngboost #215

Merged
merged 16 commits into from
May 2, 2024
Merged

[ENH] interface to ngboost #215

merged 16 commits into from
May 2, 2024

Conversation

ShreeshaM07
Copy link
Contributor

@ShreeshaM07 ShreeshaM07 commented Mar 19, 2024

Reference Issues/PRs

Resolves #135

What does this implement/fix? Explain your changes.

This PR has added a way to use ngboost regressor in skpro.
Tested it using the following piece of code

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

#Load Boston housing dataset
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
X = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
Y = raw_df.values[1::2, 2]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

ngb = NGBoostRegressor()._fit(X_train, Y_train)
Y_preds = ngb._predict(X_test)
Y_dists = ngb._pred_dist(X_test)

# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_test)
print('Test MSE', test_MSE)

# test Negative Log Likelihood
test_NLL = -Y_dists.logpdf(Y_test).mean()
print('Test NLL', test_NLL)

Does your contribution introduce a new dependency? If yes, which one?

NGBRegressor has been added from the ngboost package.

What should a reviewer concentrate their feedback on?

What modifications need to be done and some help on how to add get_test_params function and the parameters to be sent via FITTED_PARAMS_TO_FORWARD.

Did you add any tests for the change?

No

Any other comments?

PR checklist

For all contributions
  • I've added myself to the list of contributors with any new badges I've earned :-)
    How to: add yourself to the all-contributors file in the skpro root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
    See here for full badge reference
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
For new estimators
  • I've added the estimator to the API reference - in docs/source/api_reference/taskname.rst, follow the pattern.
  • I've added one or more illustrative usage examples to the docstring, in a pydocstyle compliant Examples section.
  • If the estimator relies on a soft dependency, I've set the python_dependencies tag and ensured
    dependency isolation, see the estimator dependencies guide.

@ShreeshaM07 ShreeshaM07 marked this pull request as draft March 19, 2024 10:53
@ShreeshaM07
Copy link
Contributor Author

I have implemented the _predict_proba after converting the return type of ngboost's normal distribution to skpro Normal distribution it must now be extended to other distributions that ngboost supports and supported by skpro. I will also work on implementing predict_interval and predict_quantile next. In the meantime I will also try to make the conversion of the return types into a function.

The current output upon calling
Y_pred_proba = ngb._predict_proba(X_test)
print(Y_pred_proba) is

[iter 0] loss=3.6479 val_loss=0.0000 scale=1.0000 norm=6.7178
[iter 100] loss=2.6977 val_loss=0.0000 scale=2.0000 norm=4.9421
[iter 200] loss=2.1456 val_loss=0.0000 scale=2.0000 norm=3.3556
[iter 300] loss=1.8761 val_loss=0.0000 scale=1.0000 norm=1.4400
[iter 400] loss=1.7477 val_loss=0.0000 scale=1.0000 norm=1.3337
<ngboost.distns.normal.Normal object at 0x736de3d40f50>
Normal(columns=RangeIndex(start=0, stop=1, step=1),
       index=RangeIndex(start=0, stop=102, step=1),
       mu=             0
0    16.962767
1    21.572000
2    16.014035
3    30.217625
4    19.818455
..         ...
97   11.209679
98   25.100160
99    7.723768
100  19.720101
101  24.292480

[102 rows x 1 columns],
       sigma=            0
0    1.206229
1    1.079737
2    1.241355
3    1.223454
4    1.187693
..        ...
97   1.292326
98   1.216184
99   1.304442
100  1.106799
101  1.176687

[102 rows x 1 columns])
Test MSE 16.048500328511995
Test NLL 4.412747994731287

Please let me know if it looks good or something needs to be changed.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

Some comments:

  • please look at the extension template, extension_templates/regression.py - you should not override predict_interval etc, there is no need to copy-paste everything from the base class.
  • please implement _predict_proba only, this should start with an underscore.
  • in it, it is sufficient to start at the line # Convert NGBoost distibution to skpro BaseDistribution and supply the code below. That is all you need to do. No need to do copy-paste anything from the base class.
    • of course, for all distributions that ngboost could return. I.e., you need to check which distribution is being returned, and convert to the same skpro distribution. All ngboost distributions should have a counterpart in skpro.

@fkiraly fkiraly added enhancement module:regression probabilistic regression module interfacing algorithms Interfacing existing algorithms/estimators from third party packages labels Apr 22, 2024
@fkiraly
Copy link
Collaborator

fkiraly commented Apr 22, 2024

Also, you should add ngboost as a soft dependency to pyproject.toml, and via the python_dependeices tag to the estimator.

import numpy as np
import pandas as pd
from ngboost import NGBRegressor
from ngboost.distns import Normal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imports should happen only in the estimator


def __init__(
self,
Dist=Normal,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need to change to strings.

variables should also be all lower_snake_case.
The "base learner" I would init with None, and internally replace with the sklearn default learner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please guide me as to which sklearn default learner do I use instead of ngboost default_tree_ learner.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's an sklearn learner - I would advise to hard-code it directly in __init__, i.e., if estimator is None: self.estimator_ = etc, with the precise learner spec found here: https://github.com/stanfordmlgroup/ngboost/blob/7da1ca94a8857b865cf39fca71b2564bad0c3e15/ngboost/learners.py

(I would also advise to rename the arg base to estimator, to be in line with sklearn here)

pyproject.toml Outdated
@@ -47,6 +47,7 @@ dependencies = [
"scikit-base>=0.6.1,<0.8.0",
"scikit-learn>=0.24.0,<1.5.0",
"scipy<2.0.0,>=1.2.0",
"ngboost",
Copy link
Collaborator

@fkiraly fkiraly Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not to the core dependencies please, to the optional dependencies.

please also include an upper bound, this should be current MINOR plus one

An NGBRegressor object that can be fit.
"""

_tags = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should include other tags as well: "authors" (you) and "maintainers"

from ngboost.distns import Laplace, LogNormal, Normal, Poisson, T

dist_ngboost = Normal
if self.dist == "Normal":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, very nice, that's what I meant.

I would recommend to use a dict here, that makes the logic easier to maintain and extend.
Let me know if you want pointers how ot concretely use a dict.

Converted skpro distribution object.
"""
skpro_dist = None
if self.dist == "Normal":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, using a dict would help.

X = self._check_X(X)

# Convert NGBoost distribution to skpro BaseDistribution
pred_mean = self._predict(X=X)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general idea makes sense, though I would do it differently:

  1. just call pred_dist from ngboost. This is a full parameteric distirbution.
  2. in the _ngb_dist_to_skpro, extract the parameters from the ngboost distribution and plug them into the skpro distribution.

No need to take detours via the mean or _predict, especially since some of the distributions are not directly parameterized in terms of the mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now taken care as to what the value actually returns from the ngboost and converted it to the respective value in skpro. It has also been explained using the comments as to what the distribution is and what the parameters indicate.

self._is_fitted = True
return self

def _predict(self, X):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend, remove _predict and rely on the default behaviour entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to keep _predict in case for just predicting the values. I have not used this anymore in _predict_proba. Please do let me know if you think otherwise.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I suppose it makes sense - not a blocker from my side.

early_stopping_rounds=self.early_stopping_rounds,
)
self.ngb.fit(X, y)
self._is_fitted = True
Copy link
Collaborator

@fkiraly fkiraly Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to set _is_fitted to True, the boilerplate layer in fit already does this.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress!

I left some comments above for the next iteration.

@ShreeshaM07
Copy link
Contributor Author

@fkiraly could you please review the updates I have made to the code and if anything else needs to be modified.

return [params1, params2]


# #Load Boston housing dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kindly remove the commented code - you can also use check_estimator for testing

`create_test_instance` uses the first (or only) dictionary in `params`
"""
params1 = {"dist": "Normal"}
params2 = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add more settings to cover all parameters?

It should also cover the default case, empty param set.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShreeshaM07, while we are checking, could you kindly move the empty param set to the first position? Not strictly needed, but easier to read.

self.ngb = NGBRegressor(
Dist=dist_ngboost,
Score=score,
Base=self.estimator,
Copy link
Collaborator

@fkiraly fkiraly Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the estimator needs to be cloned - otherwise it will be mutated, and you violate the sklearn extension contract (as the failing test tells you)

Copy link
Collaborator

@fkiraly fkiraly May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShreeshaM07, I think one of the failures might relate to this comment which you have not addressed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed the clone issue it is not related to that. It is with the handling of the estimator which is DecisionTreeRegressor but init was None.

@fkiraly fkiraly changed the title [ENH] Adding an interface of ngboost to skpro [ENH] interface to ngboost Apr 29, 2024
"LogNormal": LogNormal,
}

kwargs["index"] = kwargs["mu"].index
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be from X.index (from predict_proba), and y.index (from fit), not the index that ngboost returns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that makes sense. I have now made it X.index inplace of kwargs["mu"].index however I am still unable to remove the

                kwargs[skp_param] = self._check_y(y=kwargs[skp_param])
                # returns a tuple so taking only first index of the tuple
                kwargs[skp_param] = kwargs[skp_param][0]

As it is causing a shape mismatch error upon removing it.
ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (111,) and arg 3 with shape (10,).
Where 111 expected Index but arg3 would be the number of columns which seems to be 10.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have an explanation why this happens?

@ShreeshaM07
Copy link
Contributor Author

Nice, thanks! Really great addition!

I've started the CI tests to see whether they pass - I assume you have checked with check_estimator already.

Could you please help me with the check_estimator. Upon running

from sklearn.utils.estimator_checks import check_estimator
check_estimator(NGBoostRegressor())

It is throwing a TypeError: input must be a pd.DataFrame.

@fkiraly
Copy link
Collaborator

fkiraly commented Apr 29, 2024

You need to use the check_estimator from skpro.utils, not sklearn.utils


Parameters
----------
dist : string , default = "Normal"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for updating the docstring. Please check how the docs render in the readthedocs CI job.

Indentation matters! And this will render funnily, so kindly look at the other regressors and indent similarly.

You can check the result, how it looks, in the readthedocs CI job (click "details").


class NGBoostRegressor(BaseProbaRegressor):
"""
Constructor for all NGBoost models.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you probably want to describe the regressor instead, i.e., it is an interface to NGBRegressor.

@ShreeshaM07
Copy link
Contributor Author

I have resolved the shape mismatch error using ravel() which the check_estimator suggested to apply on y. I have also changed the kwargs["columns"] to be y.columns now .

But some tests are failing in the check_estimator although it is running it without any errors.

FAILED: test_constructor[NGBoostRegressor]
FAILED: test_set_params_sklearn[NGBoostRegressor]
FAILED: test_fit_does_not_overwrite_hyper_params[NGBoostRegressor-ProbaRegressorBasic]
FAILED: test_fit_does_not_overwrite_hyper_params[NGBoostRegressor-ProbaRegressorXcolMixIxYnp]
FAILED: test_fit_does_not_overwrite_hyper_params[NGBoostRegressor-ProbaRegressorSurvival]
FAILED: test_input_output_contract[NGBoostRegressor]

Could you please help out as to why these are failing.

@fkiraly
Copy link
Collaborator

fkiraly commented May 1, 2024

Could you please help out as to why these are failing.

Of course, will look at it later today.

Can you also describe what you have found out in debugging these failures? You can use raise_exceptions=True to get an error raise for debugging.

@ShreeshaM07
Copy link
Contributor Author

Can you also describe what you have found out in debugging these failures? You can use raise_exceptions=True to get an error raise for debugging.

I have now resolved all the errors except test_fit_does_not_overwrite_hyper_params.
And I have found it is being caused due to initialisation of the base learner/estimator to None in init but i am replacing it with DecisionTreeRegressor later on if estimator is default None.

Can you please suggest what i can do to solve this issue. I tried doing the initialisation to DecisionTreeRegressor which is being done in _fit in init itself if it is None but it then gives test_constructor and test_sklearn_params asssertion error.

@fkiraly
Copy link
Collaborator

fkiraly commented May 1, 2024

See above - look at the error messages, they should tell you precisely what the problem is.
The parameters written to self must always remain unmutated and identical to what was passed in __init__.

@ShreeshaM07
Copy link
Contributor Author

ShreeshaM07 commented May 1, 2024

See above - look at the error messages, they should tell you precisely what the problem is. The parameters written to self must always remain unmutated and identical to what was passed in __init__.

Yes I have now made it self.estimator_ to be passed to NGBRegressor so it is keeping the __init__ params intact and check_estimator is giving all tests passed!

Could you please run the CI jobs again.

@ShreeshaM07 ShreeshaM07 marked this pull request as ready for review May 1, 2024 08:43
@fkiraly
Copy link
Collaborator

fkiraly commented May 1, 2024

Great! Looks excellent, I will give it a review once the tests pass. From superficial inspection, looks ready to merge!

@fkiraly
Copy link
Collaborator

fkiraly commented May 1, 2024

Hm, it looks like check_estimator tests are indeed passing! Very nice!

Well, with exception of the sixth (index 5) parameter set - do you have an idea why that parameter set specifically is failing? Seems nans are being produced.


params6 = {
"dist": "TDistribution",
"natural_gradient": False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one seems to be the failing example. Any ideas?
I cannot spot any issue. Is this perhaps a bug in ngboost itself?

Copy link
Collaborator

@fkiraly fkiraly May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t-distribution does have non-degenerate natrual gradient afaik, so that might not be it?
But perhaps it's worth trying a different combination here, e.g., normal with natural gradient, and tdistribution plain.

Copy link
Contributor Author

@ShreeshaM07 ShreeshaM07 May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running with the ngboost's regressor itself it still gave Singular Matrix error for the load_diabetes dataset for T-Distribution. For all other distributions it works as expected. So just in case it was actually due to some mathematical working involving the determinant of the matrix to not be equal to 0 in case of T-Distribution, I tried with load_breast_cancer dataset ,even for this it is giving Singular Matrix error. So it might actually be the case that there's a bug in ngboost T-Distribution itself.

t-distribution does have non-degenerate natrual gradient afaik, so that might not be it? But perhaps it's worth trying a different combination here, e.g., normal with natural gradient, and tdistribution plain.

The error is not related to the natural_gradient it works for all other distributions when it is False.

Copy link
Collaborator

@fkiraly fkiraly May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great investigation!

What we could do, remove the parameter set in this PR, but file a bug report on skpro, with code that uses these parameters and the diabetes dataset. In the issue, we then try to remove the skpro interface and use pure ngboost code, that allows us to check whether it's sth in the skpro API, or genuinely ngboost. If the issue remains, we can report on the ngboost repo as a bug, with the pure ngboost code.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!

This is ready to merge, could you open the issue on the skpro tracker to investigate the TDistribution failure? With code if you have it, otherwise just a quick note.

@ShreeshaM07
Copy link
Contributor Author

Nice, thanks!

This is ready to merge, could you open the issue on the skpro tracker to investigate the TDistribution failure? With code if you have it, otherwise just a quick note.

I have now opened #291 along with the code causing the issue.

@fkiraly fkiraly merged commit 4441e55 into sktime:main May 2, 2024
36 checks passed
@ShreeshaM07 ShreeshaM07 deleted the dev branch May 6, 2024 06:24
fkiraly pushed a commit that referenced this pull request May 10, 2024
See #301 , #215 

This PR ensures `NGBoostRegressor` inherits the common conversions from
the `ngboost` distribution to `skpro` distribution from the
`NGBoostAdapter` class. This adapter is common to both the
`NGBoostSurvival` and `NGBoostRegressor`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement interfacing algorithms Interfacing existing algorithms/estimators from third party packages module:regression probabilistic regression module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ENH] interface probabilistic regressors from ngboost package
2 participants