[Bug]: type casting outcome_variable and treatment_variable(s) #232

hjk612 · 2024-03-12T23:57:50Z

Describe the bug

This is more of a nitpick :) I think there is an implicit assumption that the types of the outcome_variable and treatment_variable(s) should be float. So if we provide a dataframe to DoubleMLData where those variables are of type Decimal, the partialling out step fails with the error shown below. This is more of an issue specially when reading parquet files.

TypeError                                 Traceback (most recent call last)
Cell In[36], line 1
----> 1 dml_plr.fit(n_jobs_cv = -1)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml.py:605, in DoubleML.fit(self, n_jobs_cv, store_predictions, external_predictions, store_models)
    602         ext_prediction_dict[learner] = None
    604 # ml estimation of nuisance models and computation of score elements
--> 605 score_elements, preds = self._nuisance_est(self.__smpls, n_jobs_cv,
    606                                            external_predictions=ext_prediction_dict,
    607                                            return_models=store_models)
    609 self._set_score_elements(score_elements, self._i_rep, self._i_treat)
    611 # calculate rmses and store predictions and targets of the nuisance models

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml_plr.py:231, in DoubleMLPLR._nuisance_est(self, smpls, n_jobs_cv, external_predictions, return_models)
    226     g_hat = {'preds': external_predictions['ml_g'],
    227              'targets': None,
    228              'models': None}
    229 else:
    230     # get an initial estimate for theta using the partialling out score
--> 231     psi_a = -np.multiply(d - m_hat['preds'], d - m_hat['preds'])
    232     psi_b = np.multiply(d - m_hat['preds'], y - l_hat['preds'])
    233     theta_initial = -np.nanmean(psi_b) / np.nanmean(psi_a)

TypeError: unsupported operand type(s) for -: 'decimal.Decimal' and 'float'

Minimum reproducible code snippet

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LassoCV
from doubleml import DoubleMLData, DoubleMLPLR

df = pd.read_parquet("/...")

x_cols = [x for x in df.columns if "pre_" in x]
d_col = "event_action"
y_col = "post_outcome"

dml_data = DoubleMLData(df, y_col = y_col, d_cols=d_col, x_cols=x_cols)

learner = RandomForestRegressor(n_jobs = -1)
lasso = LassoCV()
dml_plr = DoubleMLPLR(dml_data, ml_l = learner, ml_g = learner, ml_m=lasso, score= "IV-type", n_folds = 2)
dml_plr.fit(n_jobs_cv = -1)

Expected Result

Model should fit successfully.

Actual Result

TypeError                                 Traceback (most recent call last)
Cell In[36], line 1
----> 1 dml_plr.fit(n_jobs_cv = -1)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml.py:605, in DoubleML.fit(self, n_jobs_cv, store_predictions, external_predictions, store_models)
    602         ext_prediction_dict[learner] = None
    604 # ml estimation of nuisance models and computation of score elements
--> 605 score_elements, preds = self._nuisance_est(self.__smpls, n_jobs_cv,
    606                                            external_predictions=ext_prediction_dict,
    607                                            return_models=store_models)
    609 self._set_score_elements(score_elements, self._i_rep, self._i_treat)
    611 # calculate rmses and store predictions and targets of the nuisance models

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/doubleml/double_ml_plr.py:231, in DoubleMLPLR._nuisance_est(self, smpls, n_jobs_cv, external_predictions, return_models)
    226     g_hat = {'preds': external_predictions['ml_g'],
    227              'targets': None,
    228              'models': None}
    229 else:
    230     # get an initial estimate for theta using the partialling out score
--> 231     psi_a = -np.multiply(d - m_hat['preds'], d - m_hat['preds'])
    232     psi_b = np.multiply(d - m_hat['preds'], y - l_hat['preds'])
    233     theta_initial = -np.nanmean(psi_b) / np.nanmean(psi_a)

TypeError: unsupported operand type(s) for -: 'decimal.Decimal' and 'float'

Versions

Linux-5.10.205-195.807.amzn2.x86_64-x86_64-with-glibc2.26
Python 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:07:37) [GCC 12.3.0]
DoubleML 0.7.1
Scikit-Learn 1.3.2

The text was updated successfully, but these errors were encountered:

SvenKlaassen · 2024-03-13T07:48:43Z

Thank you for highlighting this.
The predictions created by sklearn are float type such that the partialling out step fails.
I will try to add casting outcome and treatments

hjk612 added the bug Something isn't working label Mar 12, 2024

hjk612 assigned MalteKurz Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: type casting outcome_variable and treatment_variable(s) #232

[Bug]: type casting outcome_variable and treatment_variable(s) #232

hjk612 commented Mar 12, 2024

SvenKlaassen commented Mar 13, 2024

[Bug]: type casting outcome_variable and treatment_variable(s) #232

[Bug]: type casting outcome_variable and treatment_variable(s) #232

Comments

hjk612 commented Mar 12, 2024

Describe the bug

Minimum reproducible code snippet

Expected Result

Actual Result

Versions

SvenKlaassen commented Mar 13, 2024