-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] interface to ngboost
#215
Conversation
… to skpro Normal dist
I have implemented the The current output upon calling
Please let me know if it looks good or something needs to be changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
Some comments:
- please look at the extension template,
extension_templates/regression.py
- you should not overridepredict_interval
etc, there is no need to copy-paste everything from the base class.- here is a longer guide on how-to: https://www.sktime.net/en/latest/developer_guide/add_estimators.html
- please implement
_predict_proba
only, this should start with an underscore. - in it, it is sufficient to start at the line
# Convert NGBoost distibution to skpro BaseDistribution
and supply the code below. That is all you need to do. No need to do copy-paste anything from the base class.- of course, for all distributions that
ngboost
could return. I.e., you need to check which distribution is being returned, and convert to the sameskpro
distribution. Allngboost
distributions should have a counterpart inskpro
.
- of course, for all distributions that
Also, you should add |
import numpy as np | ||
import pandas as pd | ||
from ngboost import NGBRegressor | ||
from ngboost.distns import Normal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imports should happen only in the estimator
|
||
def __init__( | ||
self, | ||
Dist=Normal, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These need to change to strings.
variables should also be all lower_snake_case
.
The "base learner" I would init with None
, and internally replace with the sklearn
default learner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please guide me as to which sklearn
default learner do I use instead of ngboost default_tree_ learner
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's an sklearn
learner - I would advise to hard-code it directly in __init__
, i.e., if estimator is None: self.estimator_ = etc
, with the precise learner spec found here: https://github.com/stanfordmlgroup/ngboost/blob/7da1ca94a8857b865cf39fca71b2564bad0c3e15/ngboost/learners.py
(I would also advise to rename the arg base
to estimator
, to be in line with sklearn
here)
pyproject.toml
Outdated
@@ -47,6 +47,7 @@ dependencies = [ | |||
"scikit-base>=0.6.1,<0.8.0", | |||
"scikit-learn>=0.24.0,<1.5.0", | |||
"scipy<2.0.0,>=1.2.0", | |||
"ngboost", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not to the core dependencies please, to the optional dependencies.
please also include an upper bound, this should be current MINOR plus one
An NGBRegressor object that can be fit. | ||
""" | ||
|
||
_tags = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should include other tags as well: "authors"
(you) and "maintainers"
from ngboost.distns import Laplace, LogNormal, Normal, Poisson, T | ||
|
||
dist_ngboost = Normal | ||
if self.dist == "Normal": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, very nice, that's what I meant.
I would recommend to use a dict
here, that makes the logic easier to maintain and extend.
Let me know if you want pointers how ot concretely use a dict
.
Converted skpro distribution object. | ||
""" | ||
skpro_dist = None | ||
if self.dist == "Normal": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above, using a dict
would help.
X = self._check_X(X) | ||
|
||
# Convert NGBoost distribution to skpro BaseDistribution | ||
pred_mean = self._predict(X=X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general idea makes sense, though I would do it differently:
- just call
pred_dist
fromngboost
. This is a full parameteric distirbution. - in the
_ngb_dist_to_skpro
, extract the parameters from thengboost
distribution and plug them into theskpro
distribution.
No need to take detours via the mean or _predict
, especially since some of the distributions are not directly parameterized in terms of the mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now taken care as to what the value actually returns from the ngboost
and converted it to the respective value in skpro
. It has also been explained using the comments as to what the distribution
is and what the parameters
indicate.
self._is_fitted = True | ||
return self | ||
|
||
def _predict(self, X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend, remove _predict
and rely on the default behaviour entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful to keep _predict
in case for just predicting the values. I have not used this anymore in _predict_proba
. Please do let me know if you think otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, I suppose it makes sense - not a blocker from my side.
early_stopping_rounds=self.early_stopping_rounds, | ||
) | ||
self.ngb.fit(X, y) | ||
self._is_fitted = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to set _is_fitted
to True
, the boilerplate layer in fit
already does this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great progress!
I left some comments above for the next iteration.
@fkiraly could you please review the updates I have made to the code and if anything else needs to be modified. |
return [params1, params2] | ||
|
||
|
||
# #Load Boston housing dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kindly remove the commented code - you can also use check_estimator
for testing
`create_test_instance` uses the first (or only) dictionary in `params` | ||
""" | ||
params1 = {"dist": "Normal"} | ||
params2 = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add more settings to cover all parameters?
It should also cover the default case, empty param set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShreeshaM07, while we are checking, could you kindly move the empty param set to the first position? Not strictly needed, but easier to read.
self.ngb = NGBRegressor( | ||
Dist=dist_ngboost, | ||
Score=score, | ||
Base=self.estimator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the estimator needs to be cloned - otherwise it will be mutated, and you violate the sklearn
extension contract (as the failing test tells you)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShreeshaM07, I think one of the failures might relate to this comment which you have not addressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed the clone issue it is not related to that. It is with the handling of the estimator
which is DecisionTreeRegressor
but init
was None
.
ngboost
to skpro
ngboost
"LogNormal": LogNormal, | ||
} | ||
|
||
kwargs["index"] = kwargs["mu"].index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be from X.index
(from predict_proba
), and y.index
(from fit
), not the index that ngboost returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that makes sense. I have now made it X.index
inplace of kwargs["mu"].index
however I am still unable to remove the
kwargs[skp_param] = self._check_y(y=kwargs[skp_param])
# returns a tuple so taking only first index of the tuple
kwargs[skp_param] = kwargs[skp_param][0]
As it is causing a shape mismatch error upon removing it.
ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (111,) and arg 3 with shape (10,)
.
Where 111 expected Index but arg3 would be the number of columns which seems to be 10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have an explanation why this happens?
Could you please help me with the check_estimator. Upon running
It is throwing a |
You need to use the |
|
||
Parameters | ||
---------- | ||
dist : string , default = "Normal" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for updating the docstring. Please check how the docs render in the readthedocs CI job.
Indentation matters! And this will render funnily, so kindly look at the other regressors and indent similarly.
You can check the result, how it looks, in the readthedocs CI job (click "details").
|
||
class NGBoostRegressor(BaseProbaRegressor): | ||
""" | ||
Constructor for all NGBoost models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you probably want to describe the regressor instead, i.e., it is an interface to NGBRegressor
.
I have resolved the shape mismatch error using But some tests are failing in the
Could you please help out as to why these are failing. |
Of course, will look at it later today. Can you also describe what you have found out in debugging these failures? You can use |
I have now resolved all the errors except Can you please suggest what i can do to solve this issue. I tried doing the initialisation to DecisionTreeRegressor which is being done in |
…e to DecisionTreeRegressor
See above - look at the error messages, they should tell you precisely what the problem is. |
Yes I have now made it Could you please run the CI jobs again. |
Great! Looks excellent, I will give it a review once the tests pass. From superficial inspection, looks ready to merge! |
Hm, it looks like Well, with exception of the sixth (index 5) parameter set - do you have an idea why that parameter set specifically is failing? Seems nans are being produced. |
|
||
params6 = { | ||
"dist": "TDistribution", | ||
"natural_gradient": False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one seems to be the failing example. Any ideas?
I cannot spot any issue. Is this perhaps a bug in ngboost
itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t-distribution does have non-degenerate natrual gradient afaik, so that might not be it?
But perhaps it's worth trying a different combination here, e.g., normal with natural gradient, and tdistribution plain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried running with the ngboost
's regressor itself it still gave Singular Matrix
error for the load_diabetes
dataset for T-Distribution
. For all other distributions it works as expected. So just in case it was actually due to some mathematical working involving the determinant of the matrix to not be equal to 0 in case of T-Distribution, I tried with load_breast_cancer
dataset ,even for this it is giving Singular Matrix error. So it might actually be the case that there's a bug in ngboost
T-Distribution itself.
t-distribution does have non-degenerate natrual gradient afaik, so that might not be it? But perhaps it's worth trying a different combination here, e.g., normal with natural gradient, and tdistribution plain.
The error is not related to the natural_gradient
it works for all other distributions when it is False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great investigation!
What we could do, remove the parameter set in this PR, but file a bug report on skpro
, with code that uses these parameters and the diabetes dataset. In the issue, we then try to remove the skpro
interface and use pure ngboost
code, that allows us to check whether it's sth in the skpro
API, or genuinely ngboost
. If the issue remains, we can report on the ngboost
repo as a bug, with the pure ngboost
code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks!
This is ready to merge, could you open the issue on the skpro
tracker to investigate the TDistribution
failure? With code if you have it, otherwise just a quick note.
I have now opened #291 along with the code causing the issue. |
Reference Issues/PRs
Resolves #135
What does this implement/fix? Explain your changes.
This PR has added a way to use
ngboost regressor
inskpro
.Tested it using the following piece of code
Does your contribution introduce a new dependency? If yes, which one?
NGBRegressor
has been added from thengboost
package.What should a reviewer concentrate their feedback on?
What modifications need to be done and some help on how to add
get_test_params
function and the parameters to be sent viaFITTED_PARAMS_TO_FORWARD
.Did you add any tests for the change?
No
Any other comments?
PR checklist
For all contributions
How to: add yourself to the all-contributors file in the
skpro
root directory (not theCONTRIBUTORS.md
). Common badges:code
- fixing a bug, or adding code logic.doc
- writing or improving documentation or docstrings.bug
- reporting or diagnosing a bug (get this pluscode
if you also fixed the bug in the PR).maintenance
- CI, test framework, release.See here for full badge reference
For new estimators
docs/source/api_reference/taskname.rst
, follow the pattern.Examples
section.python_dependencies
tag and ensureddependency isolation, see the estimator dependencies guide.