Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scikit-learn-intelex integration #1316

Open
ethanglaser opened this issue Aug 15, 2023 · 4 comments
Open

scikit-learn-intelex integration #1316

ethanglaser opened this issue Aug 15, 2023 · 4 comments

Comments

@ethanglaser
Copy link

ethanglaser commented Aug 15, 2023

Context

The Intel(R) Extension for Scikit-learn (sklearnex) provides accelerations to popular classical machine learning algorithms, both on CPU and GPU. Given TPOT's heavy usage of scikit-learn algorithms, we believe there are compelling reasons for an integration of some sort with sklearnex's optimized regression and classification algorithms. Initial experimentation has shown potential for significant performance improvements - see this jupyter notebook for further detail.

image
image

Proposal

There are a few directions that this could go:

  1. Integrate sklearnex into TPOT backend and allow users to set a use_sklearnex flag when initializing their TPOT classifier or regressor, in which case their config would use sklearnex implementations of algorithms instead of the default sklearn implementation (where possible). See an example of what this might look like in the code backend here: fork and how it could translate into performance improvements in the notebook.
  • Pros: any config could be accelerated here, no exclusion of algorithms (would use default sklearn implementation if an algorithm is not supported in sklearnex), relatively clean integration as shown in the branch above
  • Cons: configs and circumstances that do not lead to heavy usage of sklearnex-supported algorithms would not get significant performance improvements (i.e. sklearnex does not have an implementation for neural_network.MLPClassifier)
  1. Create a separate sklearnex config classifier and regressor, which would yield a pipeline with sklearnex-supported algorithms (possibly something like this regressor_config_dict_sklearnex)
  • Pros: all or most algorithms included in this config would be accelerated by sklearnex, yielding optimal performance improvements
  • Cons: not all algorithms that a user might be interested in comparing would be covered by this config, and similarly - use of the existing TPOT configs that users are familiar with would not be accelerated
  1. A combination of 1 and 2. Integrate into TPOT backend with a flag for users that want to accelerate existing configs, as well as a separate config focused on the sklearnex-accelerated algorithms.
  • Pros: provides users with the most flexibility - can use the new config (option 2), accelerate existing configs (option 1), or use the original configs without accelerations as usual - fully backwards compatible
  • Cons: none other than it would be the most involved integration (but still fairly simple)

In either case, there would be corresponding docs/tests updates and an additional tutorial created for a smooth integration, as well as any other additions you feel would be necessary.

Thank you for your consideration and look forward to continuing this discussion.

@ethanglaser
Copy link
Author

Just following up on this to see if it would be of interest.

@perib
Copy link
Contributor

perib commented Sep 20, 2023

we have shifted development to TPOT2, which is a refactored version of TPOT1 that is hopefully easier to work with (We will pin something about it to the issues page soon). You can find that here https://github.com/EpistasisLab/tpot2

But yes, I would be interested in exploring this. I think option 2 makes the most sense. there are other similar accelerated packages we were considering, such as cuML. Option2 would give them all the same interface.

@ethanglaser
Copy link
Author

ethanglaser commented Oct 3, 2023

Great, I can open up a PR reflecting an integration described with option 2 to continue discussion here. I see in TPOT2 the configs are a bit different in format than in the original library, and I a not seeing the cuML config or other similar custom ones - any suggestions on approach for this?

@perib
Copy link
Contributor

perib commented Oct 3, 2023

The configuration setup is different in TPOT2. Rather than a single configuration dictionary, TPOT2 takes in three. One for the leaves, roots, and inner nodes. Additionally, we allow multiple configurations to be selected simultaneously and have broken up the configuration dictionary into modular pieces (selection, transformers, classifiers, regressors, etc). Some configurations are also not fixed and depend on the shape of your dataset. More information on how to set this up can be found in tutorial 2 here.

To add a custom configuration to TPOT2, a file defining the search space can be added to the configs folder here. Then an option can be added to this function to allow it as an option for the TPOTEstimator.

This approach could be used to add cuML support or sklearnex.

We still need to add cuML to TPOT2, which is on the to-do list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants