Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Alternative method to set PySRRegressor parameters #216

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

MilesCranmer
Copy link
Owner

PySRRegressor currently has 74 parameters. That is way too many for a single method. It can be hard to figure out what a single parameter from the long list actually does, especially for new users - who are likely overwhelmed by the large number of options.

What this work-in-progress PR does is create an alternative way to initialize a PySRRegressor object. Rather than passing all parameters at initialization (which would still be allowed), you could instead call a series of different functions to configure various parts of the regressor.

Here's an example of what this might look like:

model = (
    PySRRegressor()
        .set_space(binary_operators=["+", "-"], unary_operators=["cos"], maxsize=30)
        .set_size(niterations=100, populations=50)
        .set_parallelism(procs=8)
)

Each of these functions would define only a subset of parameters according to the groups shown in the new API page: https://astroautomata.com/PySR/api/.

This makes for much more readable code, and hopefully a model that is easier for new users to explore, rather than having all parameters dumped in a flat list. It also presents an opportunity for better documentation: these set_... functions can have more specific explanations of, e.g., what an "equation space" is (for set_space), rather than giving a general overview in the __init__ function.


There are a couple issues with this:

  1. Re-used documentation strings. You would have the docstring both at __init__ and at the set_... functions. Or, we could remove the docstring from __init__ altogether, and simply point the user to the various set_... functions they can use.
  2. Some default parameters are None. How do we also make parameters optional in the set_... functions? My solution to this was to define a global NONE="none" constant that is sort of like an alternative definition of None, so we can differentiate between the user setting a variable to None or not.
  3. A bit different than normal Scikit-learn initializations. However, I think most scikit-learn style APIs only have max ~15 parameters, so we are in a different regime.

Thanks @dfm for the idea! Would be interested in hearing your comments when you have time.

@patrick-kidger @tttc3 @JayWadekar I am eager to hear what you think as well.

@patrick-kidger
Copy link

patrick-kidger commented Nov 2, 2022

This sounds like a good idea to me.

Another option here is to define intermediate helper objects:

@dataclass
class Sizes:
    niterations: int
    populations: int

@dataclass
class Space:
    binary_operations: Sequence[Callable]
    unary_operations: Sequence[Callable]
    maxsize: int

class PySRRegressor:
    def __init__(self, sizes: Sizes, space: Space, ...)

Which can likewise help to split things up by grouping less-frequently-used parameters.

This is what I do for diffrax.diffeqsolve, for example. This has a very similar feel: differential equation solvers also take a lot of parameters, many of which are only used in particular use cases (nonlinear solvers only used for implicit problems; delays only used for DDEs; tolerances only used if you're stepping adaptively/etc.etc.)

@MilesCranmer
Copy link
Owner Author

Thanks for the ideas! I am on the fence about which way to go: passing a hierarchy of objects, or declaratively setting parameters.

One one hand, I think that if a user wants to figure out the API just by using hover-help and tab completion in their IDE, then the functional method is better, and the typing method would require they keep flipping back to the docs page, and importing various objects.

But on the other hand, the typing method seems more readable. I can definitely see the advantage of a hierarchy of types when only certain parameter combinations are compatible with eachother. And it's also more modular in case a user wants to add their own configuration, or share optimal parameter configs. (Though perhaps this would be better suited for multiple-dispatch)

Maybe you could rule out certain param combos with the declarative method too, and have the functions return different subclasses of the main regressor object. Sort of like how ggplot or [altair](https://altair-viz.github.io/) creates plots. e.g., maybe .sequence_regressor() would return a subclass that could take sequences as input, rather than tabular data. And similar for tensor operations in the future. But for that perhaps it's just better to have a separate class suited for the task... Not sure!

@tttc3
Copy link
Contributor

tttc3 commented Nov 13, 2022

I don't have any other ideas, but personally, I think the hierarchy of types is more straightforward. It avoids the duplicated docstrings, and I think is easier to understand/follow if the user hasn't read the docs beforehand. Whichever option you decide, I think it will be a nice improvement!

@MilesCranmer
Copy link
Owner Author

Thanks for the feedback @tttc3!

I’m also wondering if it might be possible to get tab completion to work by exposing a “config” module that acts to store references to each type. Then you could write, for example, pysr.config.<tab> and have it show you what types are available. @patrick-kidger what is your experience with tab completions/parameter discoverability with the type-based API?

Maybe I should just mock up a few designs and conduct a Twitter poll…

@patrick-kidger
Copy link

I've not tried using a config namespace to hold them - but otherwise yup, tab completion works well.

@MilesCranmer
Copy link
Owner Author

Thanks! How do you use tab completion? i.e., what string of text are you actually hitting tab on?

@patrick-kidger
Copy link

Typically:

-type-> diffrax.Kv
-tab-> diffrax.Kvaerno5
-type-> diffrax.Kvaerno5(no
-tab-> diffrax.Kvaerno5(nonlinear_solver=

As I have it set to prompt after two characters. (FWIW my current editors are Helix and NvChad.)

If you're thinking in terms of discoverability then that should work just the same, I think. Probably easier if you configure prompting to happen after zero characters though.

@MilesCranmer
Copy link
Owner Author

Thanks. I guess if I do <tab> after typing diffrax, there are too many options for me to know what to select, right? i.e., if I don't know to type Kv, then one might have to go to the docs?

I wonder if it would be enough to have a prefix on all config parameters, like ConfigSpace, ConfigSize, etc. Then when you hit Config<tab> , it would only show you the small set of available configs, so you could step through more easily (since Config is easy enough to memorize). Or just have a pysr.config submodule...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants