Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce honest sampling into sklearn RandomForestClassifier #387

Closed
EYezerets opened this issue Dec 9, 2020 · 3 comments · Fixed by #466
Closed

Introduce honest sampling into sklearn RandomForestClassifier #387

EYezerets opened this issue Dec 9, 2020 · 3 comments · Fixed by #466
Assignees
Labels
feature a feature we want and know how to do sklearn will try to merge into sklearn

Comments

@EYezerets
Copy link
Collaborator

EYezerets commented Dec 9, 2020

Is your feature request related to a problem? Please describe.

UncertaintyForest from ProgLearn implements both honest sampling (partitioning the feature space on one subset, estimating posteriors on a disjoint one, or structure/estimation points in Denil et al.) and finite sample correction, neither of which are currently available in sklearn RandomForestClassifier.

Implementing these features in sklearn would streamline ProgLearn (instead of needing to define the UncertaintyForest class) and allow other users to control these aspects of their random forests, rather than just turning bootstrapping on and off (which is currently available in RandomForestClassifier).

From Wager and Athey 2018 (section 2.4: Honest Trees and Forests):

In our discussion so far, we have emphasized the flexible nature of our results: for a wide variety of causal forests that can be tailored to the application area, we achieve both consistency and centered asymptotic normality, provided the subsample size s scales at an appropriate rate. Our results do, however, require the individual trees to satisfy a fairly strong condition, which we call honesty: a tree is honest if, for each training example i, it only uses the response Yi to estimate the within-leaf treatment effect τ using (5) or to decide where to place the splits, but not both.

In addition, Wager and Athey note that subsampling does not "waste" training data:

However, in our case, the forest subsampling mechanism enables us to achieve honesty without wasting any data in this sense, because we rerandomize the I/J-data splits over each subsample. Thus, although no data point can be used for split selection and leaf estimation in a single tree, each data point will participate in both I and J samples of some trees, and so will be used for both specifying the structure and treatment effect estimates of the forest. Although our original motivation for considering double-sample trees was to eliminate bias and thus enable centered confidence intervals, we find that in practice, double-sample trees can improve upon standard random forests in terms of mean-squared error as well.

Key references:

Guo R, Mehta R, Arroyo J, Helm H, Shen C, Vogelstein JT. Estimating Information-Theoretic Quantities with Uncertainty Forests. 2019; 1–19. Available: http://arxiv.org/abs/1907.00325

Wager S, Athey S. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. J Am Stat Assoc. 2018;113: 1228–1242. doi:10.1080/01621459.2017.1319839

Denil M, Matheson D, De Freitas N. Narrowing the Gap: Random Forests In Theory and In Practice. Proc 31st Int Conf Mach Learn. 2014; 665–673. Available: http://jmlr.org/proceedings/papers/v32/denil14.html

Describe the solution you'd like

After talking with the EconML team and Randal Burns, our next step is to analyze how EconML was implemented for honest regressors and adapt the tree implementation (in Cython) for ProgLearn and sklearn honest classification trees.

Update (2/23): I made a fork of the sklearn repository and will update DecisionTreeClassifier in _classes.py, as well as _tree.pyx and _splitter.pyx. Then I will figure out how to call them from forest.py when building an UncertaintyForest in ProgLearn. If this is successful, I will draft an issue in sklearn.

Describe alternatives you've considered

One possibility is to use the bootstrap = False condition with sklearn DecisionTreeClassifier to ensure that the whole dataset is used, and then specify which (disjoint) subsets of that dataset are used for partitioning vs. estimating posteriors.

Another possibility is to set bootstrap = True and modify the implementation of max_samples to make sure sampling is done without replacement and in specified proportions, disjointly.

Additional context (e.g. screenshots)

@EYezerets EYezerets added sklearn will try to merge into sklearn feature a feature we want and know how to do labels Dec 9, 2020
@EYezerets EYezerets self-assigned this Dec 9, 2020
@PSSF23
Copy link
Member

PSSF23 commented Dec 9, 2020

Is it an extension or milestone towards #9?

@EYezerets
Copy link
Collaborator Author

Oh @PSSF23 So sorry, I didn't see that earlier issue! I guess it's just a more detailed description/extension?

@EYezerets
Copy link
Collaborator Author

New text of issue in sklearn:

Is your feature request related to a problem? Please describe.

Honest sampling in forests was first outlined by Breiman et al. 1984 in Classification and Regression Forests, in which it was suggested that random forests improve performance when the feature space is partitioned on one subset of samples, and the posteriors are estimated on a disjoint subset of the samples. This idea was revived by Denil et al. as structure vs. estimation points, and clarified and implemented by Wager and Athey. They identified several benefits of honest sampling: reduced bias, centered confidence intervals, reduced mean squared error, and the possibility of building causal forests.

From Wager and Athey 2018 (section 2.4: Honest Trees and Forests):

In our discussion so far, we have emphasized the flexible nature of our results: for a wide variety of causal forests that can be tailored to the application area, we achieve both consistency and centered asymptotic normality, provided the subsample size s scales at an appropriate rate. Our results do, however, require the individual trees to satisfy a fairly strong condition, which we call honesty: a tree is honest if, for each training example i, it only uses the response Yi to estimate the within-leaf treatment effect τ using (5) or to decide where to place the splits, but not both.

In addition, Wager and Athey note that subsampling does not "waste" training data:

However, in our case, the forest subsampling mechanism enables us to achieve honesty without wasting any data in this sense, because we rerandomize the I/J-data splits over each subsample. Thus, although no data point can be used for split selection and leaf estimation in a single tree, each data point will participate in both I and J samples of some trees, and so will be used for both specifying the structure and treatment effect estimates of the forest. Although our original motivation for considering double-sample trees was to eliminate bias and thus enable centered confidence intervals, we find that in practice, double-sample trees can improve upon standard random forests in terms of mean-squared error as well.

Describe the solution you’d like

EconML has forked scikit-learn to create honest trees and generalized random forests for causal questions. We intend, instead, to merge back into scikit-learn based on the insights from EconML in regression trees, while also building classification trees, with the ability to accept both dense and sparse data.

Key references:

Breiman L, Friedman J, Stone C, Olshen R. Classification and Regression Forests. 1984.

Denil M, Matheson D, De Freitas N. Narrowing the Gap: Random Forests In Theory and In Practice. Proc 31st Int Conf Mach Learn. 2014; 665–673. Available: http://jmlr.org/proceedings/papers/v32/denil14.html

Wager S, Athey S. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. J Am Stat Assoc. 2018;113: 1228–1242. doi:10.1080/01621459.2017.1319839

Guo R, Mehta R, Arroyo J, Helm H, Shen C, Vogelstein JT. Estimating Information-Theoretic Quantities with Uncertainty Forests. 2019; 1–19. Available: http://arxiv.org/abs/1907.00325

@PSSF23 PSSF23 linked a pull request May 6, 2021 that will close this issue
@PSSF23 PSSF23 closed this as completed May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature we want and know how to do sklearn will try to merge into sklearn
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants