-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce honest sampling into sklearn RandomForestClassifier #387
Comments
Is it an extension or milestone towards #9? |
Oh @PSSF23 So sorry, I didn't see that earlier issue! I guess it's just a more detailed description/extension? |
New text of issue in sklearn: Is your feature request related to a problem? Please describe. Honest sampling in forests was first outlined by Breiman et al. 1984 in Classification and Regression Forests, in which it was suggested that random forests improve performance when the feature space is partitioned on one subset of samples, and the posteriors are estimated on a disjoint subset of the samples. This idea was revived by Denil et al. as structure vs. estimation points, and clarified and implemented by Wager and Athey. They identified several benefits of honest sampling: reduced bias, centered confidence intervals, reduced mean squared error, and the possibility of building causal forests. From Wager and Athey 2018 (section 2.4: Honest Trees and Forests):
In addition, Wager and Athey note that subsampling does not "waste" training data:
Describe the solution you’d like EconML has forked scikit-learn to create honest trees and generalized random forests for causal questions. We intend, instead, to merge back into scikit-learn based on the insights from EconML in regression trees, while also building classification trees, with the ability to accept both dense and sparse data. Key references: Breiman L, Friedman J, Stone C, Olshen R. Classification and Regression Forests. 1984. Denil M, Matheson D, De Freitas N. Narrowing the Gap: Random Forests In Theory and In Practice. Proc 31st Int Conf Mach Learn. 2014; 665–673. Available: http://jmlr.org/proceedings/papers/v32/denil14.html Wager S, Athey S. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. J Am Stat Assoc. 2018;113: 1228–1242. doi:10.1080/01621459.2017.1319839 Guo R, Mehta R, Arroyo J, Helm H, Shen C, Vogelstein JT. Estimating Information-Theoretic Quantities with Uncertainty Forests. 2019; 1–19. Available: http://arxiv.org/abs/1907.00325 |
Is your feature request related to a problem? Please describe.
UncertaintyForest from ProgLearn implements both honest sampling (partitioning the feature space on one subset, estimating posteriors on a disjoint one, or structure/estimation points in Denil et al.) and finite sample correction, neither of which are currently available in sklearn RandomForestClassifier.
Implementing these features in sklearn would streamline ProgLearn (instead of needing to define the UncertaintyForest class) and allow other users to control these aspects of their random forests, rather than just turning bootstrapping on and off (which is currently available in RandomForestClassifier).
From Wager and Athey 2018 (section 2.4: Honest Trees and Forests):
In addition, Wager and Athey note that subsampling does not "waste" training data:
Key references:
Guo R, Mehta R, Arroyo J, Helm H, Shen C, Vogelstein JT. Estimating Information-Theoretic Quantities with Uncertainty Forests. 2019; 1–19. Available: http://arxiv.org/abs/1907.00325
Wager S, Athey S. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. J Am Stat Assoc. 2018;113: 1228–1242. doi:10.1080/01621459.2017.1319839
Denil M, Matheson D, De Freitas N. Narrowing the Gap: Random Forests In Theory and In Practice. Proc 31st Int Conf Mach Learn. 2014; 665–673. Available: http://jmlr.org/proceedings/papers/v32/denil14.html
Describe the solution you'd like
After talking with the EconML team and Randal Burns, our next step is to analyze how EconML was implemented for honest regressors and adapt the tree implementation (in Cython) for ProgLearn and sklearn honest classification trees.
Update (2/23): I made a fork of the sklearn repository and will update DecisionTreeClassifier in _classes.py, as well as _tree.pyx and _splitter.pyx. Then I will figure out how to call them from forest.py when building an UncertaintyForest in ProgLearn. If this is successful, I will draft an issue in sklearn.
Describe alternatives you've considered
One possibility is to use the
bootstrap = False
condition with sklearn DecisionTreeClassifier to ensure that the whole dataset is used, and then specify which (disjoint) subsets of that dataset are used for partitioning vs. estimating posteriors.Another possibility is to set
bootstrap = True
and modify the implementation ofmax_samples
to make sure sampling is done without replacement and in specified proportions, disjointly.Additional context (e.g. screenshots)
The text was updated successfully, but these errors were encountered: